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Abstract 


This thesis investigates how the sub-structure of words can be accounted for in prob¬ 
abilistic models of language. Such models play an important role in natural language 
processing tasks such as translation or speech recognition, but often rely on the sim¬ 
plistic assumption that words are opaque symbols. This assumption does not fit 
morphologically complex language well, where words can have rich internal struc¬ 
ture and sub-word elements are shared across distinct word forms. 

Our approach is to encode basic notions of morphology into the assumptions of 
three different types of language models, with the intention that leveraging shared 
sub-word structure can improve model performance and help overcome data sparsity 
that arises from morphological processes. 

In the context of n-gram language modelling, we formulate a new Bayesian 
model that relies on the decomposition of compound words to attain better smooth¬ 
ing, and we develop a new distributed language model that leams vector represen¬ 
tations of morphemes and leverages them to link together morphologically related 
words. In both cases, we show that accounting for word sub-structure improves the 
models’ intrinsic performance and provides benefits when applied to other tasks, 
including machine translation. 

We then shift the focus beyond the modelling of word sequences and consider 
models that automatically learn what the sub-word elements of a given language are, 
given an unannotated list of words. We formulate a novel model that can leam dis¬ 
contiguous morphemes in addition to the more conventional contiguous morphemes 
that most previous models are limited to. This approach is demonstrated on Semitic 
languages, and we find that modelling discontiguous sub-word structures leads to 
improvements in the task of segmenting words into their contiguous morphemes. 
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Chapter 1 
Introduction 


People can understand sentences and words they have never heard before. We do this 
by combining the interpretations of parts of the sentence, such as individual words or 
phrases, or parts of words, into an interpretation of the whole. Two key parts of this 
compositional view of language is that the smaller meaningful entities are easily 
identifiable and that their meanings are likely already known. To illustrate this, 
consider how immediately the following sentence is understandable to an English 
speaker: 

The king finally abdicated after years of unkingly conduct. 

This sentence is understandable despite the unlikeliness that one will have previ¬ 
ously encountered the word unkingly. So not only is the sentence interpretable as a 
function of the words it contains, but the word unkingly is itself interpretable as a 
function of the morphemes it contains—one can compose the meaning of king with 
the -ly suffix to derive the adjective, while the prefix un- supplies the negation. 

In the field of natural language processing (NLP), a general approach is also to 
decompose problems as the manipulation of elements smaller than whole sentences. 
We thus face the task of determining what the elementary units of language are that 
need to be represented. A simplistic solution is to take punctuation and spaces as 
demarcating the boundaries of the elementary tokens we need to compute with, and 
it is common to regard the resulting word tokens as opaque symbols without sub¬ 
structure. 
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One measure for the suitability of such tokenisation is how readily an NLP sys¬ 
tem would encounter a token not observed in the corpus of text it was trained on. 
That is, how much coverage does a corpus provide relative to the hypothetical set of 
unique tokens forming the vocabulary of the language under consideration? Aside 
from phenomena such as slang, technical jargon and proper nouns, coverage de¬ 
pends heavily on what semantic and grammatical information a given language en¬ 
codes into individual word forms. Languages that are closer to the analytic end of 
the linguistic typology spectrum tend to construct word forms from single or very 
few morphemes. At the other extreme are synthetic languages, which combine many 
morphemes into a single word form, either by chaining them together (agglutinative) 
or fusing the information of multiple conceptual morphemes into a single affix. 

English is a suitable instance of a language that is relatively analytic, as it uses a 
fairly simple system of inflection and conjugation. Grammatical relations are largely 
encoded by dedicated function words (e.g. the and of) and syntactic structure. Few 
grammatical distinctions are marked directly on words—for example, English nouns 
are inflected only for their grammatical number, with the plural typically indicated 
by an -s suffix. In more analytic languages the assumption of opaque symbols is 
therefore quite reasonable. Although as the derivational example of unkingly shows, 
this assumption is an oversimplification even for morphologically ‘simple’ English. 

The same assumption is not well-suited to more synthetic languages. Czech, 
for example, marks two number and seven case distinctions on nouns using a single 
fused suffix, giving rise to ten different inflectional variants of the word krai (‘king’), 
including krale, krdli, krdlem , krdlove , etcQOr in the case of strongly agglutinative 
languages, it is possible to create new word forms by the repeated addition of suf¬ 
fixes, theoretically without bound. Turkish is a classic example of this^J 

ev (‘house’) 

eviniz (‘your (pi.) house’) 

evinizdeyim (‘I am at your house’) 

evinizdeymi§im (‘I was apparently at your house’) 


'From http://en.wiktionary 

org/wiki/krai accessed 15/04/2014. 

2 From http : //en . wikipedia . org/wiki/Turkish_language., accessed 

15/04/2014, verified with the analyser of 

£oltekin (2010). 
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Certain Germanic languages follow a similar regime in the formation of compound 

nouns, as in the German example: 

Regen (‘rain’) 

Regenschirm (‘umbrella’) 

Regenschirmhersteller (‘umbrella manufacturer’) 

Regenschirmherstellergewerkschaft (‘umbrella manufacturers trade union’) 

To assume that the words in these examples are atomic symbols ignores the very ap¬ 
parent ways in which they share sub-structure. A computational system that ignores 
sub-word structure is likely to obtain poor coverage of the vocabulary of a language 
from a corpus, since morphological processes readily create forms that are very rare 
or non-existent in a given corpus. 

A central theme in this dissertation is therefore to develop methods that account 
more explicitly for the observed grammatical and semantic links between word 
forms that share morphemes in their orthographic representations. The notion of 
training language processing systems from example corpora provides a bridge to the 
other defining element of the dissertation, which we turn to next. 

A computational system that processes language in a useful way is tasked with 
more than merely representing the right symbols. Approaches to the larger question 
of casting the rules and idiosyncrasies of natural languages in algorithmic form can 
be characterised based on two extremes. At the one extreme lie knowledge-intensive 
approaches that explicitly specify the symbols and rules relevant to a language pro¬ 
cessing task, e.g. morphological analysis or translation, according to linguistic the¬ 
ories or hand-crafted heuristics. This can yield systems that work predictably and 
to a high standard, especially when the application domain is very limited. But this 
approach has various downsides, foremost of which is that it requires substantial ad¬ 
ditional labour to transfer a solution to new domains or other languages. Moreover, 
a reliance on dictionaries and prescriptive rules does not acknowledge the dynamic 
nature of human language, where new words and expressions arise continually. 

At the other extreme are approaches that try to equip computer systems with a 
learning ability, rather than supplying them directly with linguistic abilities. The 
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premise is that an appropriate combination of learning algorithm and data can pro¬ 
duce a level of language processing that is useful. In many instances, we can assume 
that examples of language use (data) arise faster and more organically than linguis¬ 
tic expertise or linguists, which speaks in favour of approaches that rely on data. 
This notion has been a driving force in the success and prominence of statistical ap¬ 
proaches to machine translation and speech recognition over the past two decades. 

Model-based, data-driven approaches overcome some of the disadvantages of 
knowledge-intensive methods. In principle, they scale more readily across different 
languages and domains if adequate data is available. This latter condition is key. 
To clarify it further, the performance of NLP systems in this general category is 
a function of both model assumptions and data. The transferability, however, also 
makes it tempting to accept that a given set of model assumptions apply sufficiently 
well to multiple languages, which is not in general the case. 

A particular case where a one-size-fits-all scheme is problematic is that of vari¬ 
ations in morphological complexity. Many statistical models of language adopt the 
token-based view that we argued is ill-suited for morphologically rich languages. 
The fundamental problem is that, at the level of word tokens, many legitimate word 
forms occur very rarely, or do not occur at all, in a training corpus. This data sparsity 
makes vocabulary coverage poor and hampers the robustness of parameter estimates, 
both factors which constrain the performance of a model when processing new data. 

This dissertation argues in favour of integrating high-level intuitions about mor¬ 
phology into the assumptions of statistical language models in order to address 
the aforementioned challenges posed by token-based processing of morphologically 
rich languages. The guiding principle is that morphologically related words should 
share statistical strength—for example, separate observations of the words king and 
unlikely should trigger the ability of a model to produce an informed response to 
an unknown word like unkingly, and instances of krali and kralove should inform 
system behaviour towards krdlem. 
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1.1 Problem statement 


The preceding section framed this dissertation in terms that are meant to be under¬ 
standable to a broader audience, which necessarily introduces some imprecision. In 
this section we now state and motivate the thesis in more precise terms. 

Our thesis is that basic linguistic intuitions about how words are formed in mor¬ 
phologically rich languages can be exploited in the formulation of probabilistic mod¬ 
els that perform better at their tasks. Probabilistic modelling offers a coherent math¬ 
ematical setting in which to express assumptions. It allows difficult tasks to be 
reformulated accurately in terms of simpler tasks, e.g. “What is the probability of 
observing this sentence?” can change to the more approachable “What is the prob¬ 
ability of observing each word in the sentence?”. Similarly, probabilistic models 
can be extended and combined in a principled way, limiting the number of dials that 
have to be set manually in heuristic approaches. 

We deal with language modelling tasks of two kinds. The first is the narrow 
sense that a language model (LM) assigns probabilities to word sequences. This is 
the sense in which the term language model is widely understood in the literature, 
and which will henceforth be implied in this dissertation unless otherwise stated. 
A token-centric view in language models is at odds with morphologically rich lan¬ 
guages, where sub-word elements are correlated in meaningful ways across differ¬ 
ent words. By accounting for such correlations in the design of LMs, making use 
of morphological information that we assume is given, we show in this dissertation 
how the sparsity in the training data of morphologically rich languages can be miti¬ 
gated to improve their predictive performance. Predictive performance is quantified 
primarily in two ways. The first is to measure a LM’s intrinsic ability to predict 
word sequences in held out test data, which we do in terms of the perplexity metric. 
The second is to consider what effect our morphology-targeting LM extensions have 
when situated in a machine translation system. In this context, LMs play a crucial 
role in biasing system output toward translations that are more fluent renditions of 
the target language. LMs that encode patterns at the sub-word level are by definition 
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better suited for this role when translation is into morphologically rich languages. 
We thus measure translation quality in such settings by comparing system output to 
reference translations, using the automated quality metric Bleu. 

The second type of modelling task we consider is that of inducing meaning¬ 
ful sub-word structures from raw text or word lists. Whereas the previously out¬ 
lined task is primarily one of sequence-based density estimation, this second sense 
of modelling natural language focuses on the unsupervised learning of word struc¬ 
ture, independent of actual word usage in sentences. A dominant approach in this 
task is to model words as sequences of morphemes, with the goal of learning what 
those morphemes are and how to segment a given word accordingly. This type of 
language modelling (in the broad sense) is of practical interest in various other sub¬ 
problems in NLP, including part-of-speech-tagging, syntactic parsing, or indeed, for 
integrating sub-word level information into LMs, as we do in the first part of this 
dissertation. The idea that basic linguistic intuitions can inform and improve the 
design of a model finds expression in morphology learning by observing that some 
morphological processes involve discontiguous sub-word elements. We exploit this 
to devise a model that goes beyond linear segmentation to capture elements of non- 
concatenative morphology as well. As an example, in Arabic, the words kitab and 
kutub share the same root k-t-b. We show that modelling such discontiguous mor¬ 
phemes can improve the segmentational component of a model, and that it can iden¬ 
tify part of a lexicon of such morphemes given merely an unannotated list of words. 

1.2 Dissertation outline 

The dissertation is structured as follows. 

Chapter [2] provides further background central to understanding the rest of the 
material, and motivates the technical approaches followed. 

Chapters [3] and [4] document two different approaches to integrating morphologi¬ 
cal information into probabilistic LMs. Chapter[3]introduces a hierarchical Bayesian 
model that accounts for the productive compound word formation attested in certain 
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Germanic languages. Chapter [4] introduces a method for integrating morphologi¬ 
cal information into probabilistic LMs based on distributed feature representations. 
The method is demonstrated to be effective across a range of languages featuring 
different levels of morphological complexity. 

In Chapter[5] the focus shifts to language modelling in the broader sense, and we 
present an unsupervised approach to learning contiguous and discontiguous mor¬ 
phemes from raw text. The approach is applied primarily to Semitic languages, 
which make use of both ki nds of morphemes. 

Finally, Chapter [6] draws together the work of the preceding three chapters to 
summarise our main findings and present possible avenues for further research. 

The remainder of this chapter highlights the specific research contributions pre¬ 
sented in this dissertation. 


1.3 Main research contributions 


New generative model of productive compounding 

We present a new language model where the formation of compound words 
in terms of their theoretically unbounded number of components is part of the 
generative process. This is done by extending the Hierarchical Pitman-Yor 
Language model with an additional back-off level where compound heads are 
conditioned on the sentential context, a decision based on grammatical consid¬ 
erations, while compound modifiers are generated by an additional language 
model. The effect is to reduce data sparsity due to compounding, and it results 
in higher precision outputting of compounds from a translation system using 
our language model. This work was originally presented at the 2012 EACL 
student workshop ( ]Botha| , j2012] ) and published in the proceedings of COLING 
dBotha et al.[|2012j ). 


• Unsupervised learning of distributed morpheme representations 

We present a simple technique for learning distributed feature representations 
for morphemes (given an external source of morphological segmentation) as 
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part of a distributed language model. We show that the resulting morpheme 
vectors can be composed through addition to form word vectors that perform 
well in replicating human judgements of the semantic similarity of pairs of 
words. The benefit lies in being able to go beyond the vocabulary of the orig¬ 
inal source of the word vectors. 


• Distributed language modelling for rich morphology 

The model within which the aforementioned morpheme representations are 
obtained is a novel variant of the log bilinear language model. Its key property 
is to provide a soft tying of parameters for words that share morphological 
content. We show that this is particularly beneficial for modelling rare words. 


Integration of normalised distributed language model into a machine trans¬ 
lation decoder 

We provide the first demonstration, to our knowledge, of integrating a nor¬ 
malised distributed language model into a machine translation decoder. This 
and the previous two contributions listed here form part of work accepted for 
presentation and publication at ICML 2014 (Botha and Blunsom[|2014). 


• Joint representation of contiguous and discontiguous morphemes 

We formulate a novel approach to the problem of representing concatenative 
and non-concatenative morphology in a unified fashion. This is done by ap¬ 
plying mildly context-sensitive simple Range Concatenating Grammars (SR- 
CGs), and we illustrate the approach on Semitic morphology. This and the 
next contribution listed here were originally reported in the proceedings of 


EMNLP 2013 (Botha and Blunsom, |2013[ ). 


• Formulation of mildly-context sensitive adaptor grammars 

In order to do unsupervised learning within our SRCG-based approach to 
modelling morphology, we formulate an extension of adaptor grammars to 
the SRCG-formalism. This provides an additional method for inducing SRCG 
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grammars and their equivalents, which could be useful in applications beyond 
morphology. 
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Chapter 2 
Background 


Chapter Abstract 

This chapter introduces background material that the rest of the dis¬ 
sertation builds on. We formulate the key aspects of sequence-based 
statistical language models (LMs) and their evaluation. We present two 
alternative perspectives on language modelling by introducing back-off 
n-gram models and distributed language models. This sets up a short 
discussion of some existing methods for integrating sub-word level in¬ 
formation into those LMs, and what their limitations are. The latter 
half of the section introduces aspects of Bayesian probability theory and 
non-parametric modelling that are relevant in subsequent chapters. Al¬ 
though most terminology and notation will be introduced when needed, 
we begin this section with some preliminaries. 


2.1 Preliminaries 

This dissertation relies heavily on probability theory, especially as defined for dis¬ 
crete events. A discrete random variable X can take on any value from an event 
space fl = {xi,x 2 ,... }, which encodes the potential outcomes of a stochastic 
operation or process. A simple example is that of tossing a coin, where Ll = 
{heads, tails}. In general, Ll can be a countably infinite set. 

A probability mass function P : Q —> [0,1] maps a particular value ay of X to 
the probability of that the event X = x t occurs, namely P(X = xf). We will some¬ 
times abbreviate this as P(xf), taking the random variable assignment as implicit. 
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This function characterises the distribution of the random variable, and accordingly 
we usually refer simply to a “probability distribution”. The event space is also 
referred to as the support of the distribution. 


2.2 Statistical language modelling 


The canonical task in statistical language modelling is to assign a probability to a 
sequence of words, e.g. 

P(lhc sky is grey ) = ? 

Such probabilities are inferred by using statistical estimates from data, in other 
words, the model is trained on a corpus of text in the language of interest. 

LMs play a crucial role in applications like speech recognition, machine transla¬ 
tion and predictive text input, where an intended utterance must be decoded from an 
ambiguous signal. LMs that better capture the regularities and idiosyncrasies of nat¬ 
ural language are better at discriminating among alternative output candidates that 
arise in these applications. They are not directly concerned with the input signal 
(audio features, a foreign language, or a sequence of keystrokes), but instead reward 
output that is a more fluent and coherent rendition of the target language. 

The fundamental challenge in assigning probabilities to word sequences is to 
contend with the fact that the set of possible word sequences is infinite, while any 
given data set used for estimation is finite. The thesis investigated here is premised 
on the observation that morphological processes have a strong effect on the sparsity 
of training data. Broadly speaking, more complex morphological processes exacer¬ 
bate data sparsity. The overall approach of modelling sub-word elements is therefore 
intended to help overcome data sparsity. The motivation is that this is likely a more 
viable long term solution than estimating morphologically blind models from ever 


more data, effective as that may be for particular languages in the short term (Brants 


et al. 2007). 
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The dominant strategjQin LM estimation is to decompose the probability P(w) 
of a word sequence w = w\.. .wl, using the chain-rule, as 

L 

-P( w )= n p ( wi i wi ' ■ ■ ■ 

2=1 

The task of estimating probabilities for entire word sequences of arbitrary length is 
thereby simplified into one of estimating probabilities for individual words, condi¬ 
tioned on all the preceding words in the sequence. An approximation that further 
simplifies this is based on the assumption that a particular word only depends di¬ 
rectly on the ones immediately preceding it. A word sequence can accordingly be 
modelled as a Markov chain of order n — 1: 

L 

P(w) « P{Wi | tWj-n+i, . . • , Wi- 1) . (2.2) 

2=1 


The sequence of n words 1 ,...,wy) is commonly referred to as an n-gram, 

with history h = (tv,- n + i,..., uy_i). We use the short-hand notation for a word se¬ 
quence, w{ = .... Wj). Alternative terms that we use are that h is the context 

from which the target word Wi must be predicted. The support of the conditional 
distribution P(wi \ w\ Z^, +1 ) is the vocabulary V, the discrete set of word types to be 
modelled. 

Sequence boundaries require some special handling in Markov chains. Com¬ 
mon practice is to assume that the conditioning context of the first token w\ in a 
sequence is provided by a special padding symbol, while the use of a designated 
symbol marking the end of the sequence allows for coherent modelling of variable- 
length sequences. 

We often need to compute the probability of a set of word sequences, such as 
sentences in a corpus. If (wi,..., w N } are N independent word sequences, then 
the joint probability (using the aforementioned Markov assumption) is 

N N |wj[ 

p( Wl , , WJV )=yi p (™j )=n n p ( w p i w ]T-n +0 > 


3 =1 


j= 1 i=l 


'Other strategies also exist, e.g. the whole-sentence model of 


Rosenfeld et al. 


(2001 
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In subsequent formulations we refer to a corpus simply as one sequence of tokens, 
wf 1 , taking these mechanics of boundary symbols and sentence independence as 
being implied. 

The reliance on the conditional probability distributions P(w*I* +1 ) makes pa¬ 
rameter estimation more tractable and is advantageous for the dynamic programming 
employed by machine translation (MT) decoders. We consider language model esti¬ 
mation in § |2.2.2 For now, we assume an estimated model Plm is given and consider 
how its quality can be evaluated. 


2.2.1 Language model evaluation 

A common evaluation method is to measure how well the language model Plm pre¬ 
dicts previously unseen test data wf. Perplexity is a useful metric for this purpose, 
as it is well-defined for any length N and allows for a fair comparison of models in 
spite of differences in their assumptions or internal workings. The perplexity of a 
model Plm on the test sequence w* is 

ppl = 2 h ^ Plm \ (2.3) 


where H is the cross-entropy 


N 


H(Plm) = ! o §2 P(m | w{ *) 


2=1 


(2.4) 


The intuitive interpretation of a perplexity value of k is that the uncertainty of the 
model with respect to the test data is equivalent to the uncertainty in the outcome 
of throwing a /c-sidcd fair die. A model is thus on average fc-ways perplexed by the 
observation of each test word. Lower perplexity corresponds to lower uncertainty, 
which is the mark of a better model 0 Note that perplexity, as defined above, is 
independent of model assumptions such as the Markov assumption applied in the 
formulation of n-gram models. 

A comparison of two models Plm x and Plm 2 ' n terms of their perplexity on a 
given data set is only fair if the support of the probability distributions is equivalent, 

ch. 7). 


-For a basic mathematical derivation of perplexity in terms of entropy, see (Koehn 


2010 
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in other words, if they model the same vocabulary V. This matter requires special 
attention when going beyond the standard LM assumption that V is a finite set, as 
we do in Chapter [3} 


2.2.2 Two dominant approaches to Markov-based LMs 

The focus now turns to two dominant approaches to modelling the categorical con¬ 
ditional distributions P(wi \ w[ 1 ) used in Markovian language models. 

1) One approach is to infer these distributions by using exact-matching of n-grams, 
thereby directly representing the high-dimensional discrete space of n-grams. For 
historical reasons, this class of models are referred to simply as n-gram language 
models. 

2) The other major approach is to embed words in a low-dimensional, continuous 
vector space and learn a function that uses the arising word vectors to map an n-gram 
history into a probability for the target word Wi. These distributed language models 
are also known by the terms neural, connectionist or continuous-space language 
models. 

We instantiate the thesis that basic linguistic intuitions about word formation 
are beneficial in modelling morphologically rich languages within each of these two 
approaches. The next two sections present them in more detail. 


2.2.2.1 n-gram language models 


A discrete n-gram LM is in fact a collection of conditional multinomial distributions 
like P{wi | w\ Z^ +1 ), estimated directly from the empirical distribution of //-grams 
in training data. Maximum likelihood estimates suffer a variety of problems: Some 
n-grams are not observed in the training data yet a model must still be able to score 
them, //-grams that do appear in the data may do so with a low frequency that is 
not representative of their true distribution in the language ( Good[fT953 ). To address 
these and other more subtle challenges arising from data sparsity, a wide variety 
of techniques have been devised to smooth the empirical n- gram distributions by 
redistributing probability mass (Chen and Goodman|, 19981. 
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The interpolated modified Kneser-Ney (MKN) model (Kneser and Ney, 1995 


Chen and Goodman, 1999) is considered the state-of-the art discrete n- gram LM 


(Chelba et al., 2014). It serves as a baseline model in the empirical evaluations 
reported later in this dissertation. We briefly introduce key smoothing concepts it 
relies on, some of which have analogues in the models we subsequently develop. 

The first concept is that of basing probability estimates of a certain random event 
on more general random events. Following on from the initial example in this chap¬ 
ter, we can regard the observation of the word grey in the context sky is as a more 
specific event than observing it in the shorter context is. That is, the bigram is 
grey is a generalisation of the trigram sky is grey. Thus to overcome the data spar¬ 
sity inherent in estimating //-gram LM parameters, an effective technique is to let 
P LM (grey \ sky, is) be a function of Pigrcy \ is). 

More formally, defining Plm{wi | u;*I r ' + ( ) as a function of lower-order condi¬ 
tional distributions Plm(wi | w'l-ri+ 2 ) can be done through interpolation (Jelinek 


and Mercer} |1980) or by strictly switching to the lower-order model (Katz, 1987). 


Interpolation typically works better (Chen and Goodman 1998), so we restrict the 
attention to its functional form: 

P L M{wi | w\~_ * + i) = a{w \_ n+1 ) + y{w\Z l n+1 ) P LM {wi I <^+ 2 ) , (2-5) 

where a(w;*_ n+1 ) denotes a smoothed version of the n-gram conditional distribu¬ 
tion (see below), while the 7 parameters are set to normalise the overall probability 
distribution, Plm(wi \ w i-n+ 1 ) • This procedure can be followed recursively and 
bottom out with a reliance on unigram model P LM {wi). 

The matter of defining q: (•) provides the connection to the second important 
smoothing concept of discounting. Instead of relying directly on the actual number 
of times an n-gram is observed in the training data, C(wj), estimation is done using 
counts that are adjusted down, to give a discounted frequency C disc {wl) < C[wl). 
MKN uses absolute discounting (Ney et al. 1994) to adjust an ?z-gram count by a 
constant D m , 


C M KN{w\- n + 1) = maX (C(w\_ n+ 1) - An, 0 ) , 


( 2 . 6 ) 
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where m G {1, 2, 3+} designates one of three discount levels according to whether 
the n-gram occurs once, twice, or more than twice. 

MKN combines discounting and (interpolated) back-off in a particular way. q: (•) 
is the discounted maximum likelihood estimate 


<x(w\- n + 1 ) = 


C M KN{w\- n +l) 

Ciwtl + 1 ) 


(2.7) 


The remaining element for using Equation 2.5 is to specify the back-off distribution 


Plm( 


Wj w, 


,i— 1 
i—n +2 


). Kneser-Ney smoothing uses purpose-built lower-order distribu¬ 


tions that exploit the intuition that the diversity of contexts a word appears in often 
represents a more accurate view of its true probability than the n-gram count itself. 
That is, the empirical prevalence of the n-gram Los Angeles overstates the true prob¬ 
ability of the word Angeles, so Kneser-Ney smoothing bases the estimate instead on 
the number of unique // -gram histories the word appears in. 

n-gram language models, and MKN smoothing in particular, have been relied 
on heavily in the domain of machine translation in the last 15 years, and have more 
recently received significant amounts of engineering attention in order to scale to 
vast training data sizes ( Heafieldl 2011 Heafield et al. 2013). A main limitation of 
MKN is that it is fundamentally heuristic-based, albeit heuristics derived from rigor¬ 
ous empirical work. This makes it challenging to incorporate new smoothing ideas. 


Teh (2006a) proposed a Bayesian approach to n-gram language modelling which 


implements some of the same principles that MKN depends on, but offers a frame¬ 
work that better lends itself to modifications of the basic smoothing behaviour. In 
Chapter [3] we use that framework to create a more sophisticated smoothing method 
which exploits simple linguistic intuitions about word formation. 


2 . 22.2 Distributed language models 

A salient feature of the n-gram models in the previous section is that they parametrise 
a high-dimensional discrete space directly, based on the identities of words. This 
amounts to a harsh binary perspective on the way words are correlated in text by 
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requiring that a test n-gram (or its backed-off version) exactly match an event in the 
training data. 

An alternative approach that acknowledges a more multifaceted notion of cor¬ 
relation among word occurrence is to represent words in a multidimensional vector 
space. Thus instead of each word type v in the vocabulary V being merely a sym¬ 
bol in a discrete set, each v has an associated real-valued distributed feature vector 
r„ G W l . This immediately enables reasoning about the relations among words in 
terms of how they relate in vector space, with the intuition that “similar” words are 
closer to each other, using a metric such as the angle or Euclidean distance between 
their representation vectors. As an example, the words of the vocabulary fragment 
V = {blue, clouds, grey, skies, sky,... } could have feature representations that look 
as follows when plotted in two-dimensional space: 


X 2 



In general, the different dimensions have no intrinsic meaning, although this may 
depend on the method used for obtaining distributed representations. For example, 
in the area of distributional semantics, vector representations of words have been 
constructed directly from corpus-based co-occurrence statistics, using the most fre¬ 
quent words as basis vectors (Bullina ria and Levy} |2007| [Mitch ell and Lapata|[2008 


inter alia). However, we consider models where the word vectors are leamt as part 
of LM training, and the different dimensions simply capture different correlates. 

Distributed language models (DLMs) apply distributed feature representations in 
assigning probabilities to word sequences. The general idea can be stated as follows 
in terms of n- gram modelling: An //-gram history w\z] l+l is mapped to a vector 


x G 


d ' as some function /(•) of its word vectors, i.e. x = /( r Wi _ T 


.+ 1 ? * * * 5 


where d! is free to differ from d. A further function g(-) then transforms x into a 


conditional probability: P LM ( 


Uh W. 


,i— 1 
i—n +1 


) — g(wi,x). A variety of neural network 
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architectures and related methods have been used to model the functions / and g 
while jointly learning the representation vectors (Xu and Rudnickyj 2000[ Bengio 


et al.[ 2003 1 Morin and Bengio[|2005| Mnih and Hinton 2007} Mikolov et al.[ 2010). 
The formulation above is a simplification of the neural probabilistic language 


model (Bengio et al. 2003), but is sufficient to highlight the key advantages of DLMs 
over the discrete n-gram LMs of the previous section. The main benefit is that, by 
making the conditional probabilities smooth functions of the n-gram context, gen¬ 
eralisation to new n-grams is more automatic. It becomes possible for two lexically 
distinct phrases such as the grey clouds and the blue sky to effect similar condi¬ 
tioning on an immediately following word, assuming the word vectors involved 
are pair-wise “nearby” in the distributed representation space as demonstrated in 
the preceding diagram. This obviates the need for explicit back-off strategies. In 
more general terms, DLMs mitigate the sparsity problem associated with the high¬ 
dimensional parameter space of discrete n-gram LMs (which is in principle \V\ n ) 
by operating in a lower dimensional continuous space. In particular, d <C |V|, and 
values of 50 to 1000 are typical for d, whereas typically |V| > 10 5 , or much larger 
for morphologically rich languages. 

This dissertation addresses two issues in DLMs. The first is to consider how they 
can be improved for morphologically rich languages by using the internal structure 
of words, and the second is to demonstrate the full integration of DLMs into a ma¬ 


chine translation system (chapter 4). 


2.2.3 Language modelling at the sub-word level 

Both approaches to language modelling presented above usually consider words as 
the fundamental modelling unit. They seek to mitigate data sparsity in different 
ways, but do not directly address a major source of that data sparsity, which is that 
of complex morphology. Morphological complexity comes in many guises, but the 
common outcome is that languages with more complicated rules for forming words 
tend to have larger vocabularies of surface word forms. This exacerbates the data 
sparsity to which n-gram LMs are prone to by definition, while DLMs are also 
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stretched by having fewer training instances of individual word forms from which 
to leam suitable representation vectors and the accompanying probability function. 

A further implication of rich morphology is that test data can be expected to 
contain relatively many word forms not appearing in the training vocabulary^] The 
standard practice for dealing with this out-of-vocabulary (OOV) problem is to set 
the model up in a way that reserves some probability mass for a generic “unknown- 
word” symbol, which we will denote as Unk. This has the shortcoming of treating 
truly novel words, such as proper nouns, the same as previously unseen morpholog¬ 
ical variants of a known word, ignoring the obvious relation among them. 

Both these problems can be addressed by shifting the fundamental unit of rep¬ 
resentation to be at the sub-word level. The sought-after effects is that morphologi¬ 
cally related word forms should share statistical strength in some way and that some 
OOVs can be handled in terms of their sub-word structure. 

Previous approaches to integrating morphological information into language mod¬ 
els have been proposed]^] 

A naive strategy is to segment words into their morphemes in a preprocessing 
step, and then construct otherwise unmodified language models from the modified 
tokenisation. As an example of decomposition, the sentence 
The forecasters predicted more flooding 
may be segmented into morphemes to give 

The forecaster s predict ed more flood ing 
This technique reduces the number of unique token types, which directly addresses 
sparsity, but introduces its own problems: It begs the question of how aggressively to 
segment—this often opens the door to ad-hoc heuristics that do not transfer reliably 
across multiple tasks or languages. And it disrupts the contextual conditioning in 
back-off n-gram models by making a hard switch away from modelling sentential 


3 Mismatches in the domains of text are another major source of out-of-vocabulary words in test 

data. _ 

4 The focus here is on the types of language models introduced in §2.2.2 


but we note that 

maximum-entropy language models also offer a useful way of integrating morphological informa¬ 
tion, e.g. (Minkov et al. 2007). 
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coherence, say via P(predicted \ the, forecasters ), toward modelling fragments, say 


P(ed | s, predict). This strategy has been used in speech-recognition (Geutner 1995 


Ircing et akj 2001 1 |Hirsimaki~et al.| 2006), and machine translation (Virpioja et al. 


2007). 


A more systematic approach is that of factored language models (FLMs) (Bilmes 


land Kirchhoff) |2003[ |Kirchhoff et al.| |2008| ). FLMs cast each word as a bundle of 
K features ..., f- K1 ) over which a variety of conditional probability 

distributions are defined. To clarify the notation, a standard n-gram model trivially 
uses word form as its single feature (K = 1) and the n-gram conditional distribution 
is written as ’ | /£ } n+1 ,..., /H) • The intention is to allow for the integration 

of information beyond word forms. For example, a word’s part-of-speech and stem 


could be used as additional features, = (/ surface % / PoS % / stem,: ), as in cloudy = 
(cloudy, noun, cloud ) ( K = 3). 

This factored representation bears superficial resemblance to the distributed fea¬ 
ture representations introduced in § |2.2.2.2| The main difference is that the FLM 
features are discrete random variables over which arbitrary directed graphical mod¬ 
els are defined. Furthermore, feature values are likely set with reference to some 
external source and not trained as part of the FLM. 

A FLM that goes beyond standard n-gram conditioning might incorporate a sub¬ 
model / ; (/, sul lace | ff-T) > which conditions the target word on the stem of the preced¬ 
ing word. Multiple conditional distributions of this ki nd can be combined, similar 
to the way standard //-gram models rely on lower-order conditionals. But with the 
richer information encoded, an important question becomes how to prioritise the 


back-off sequence. Bilmes and Kirchhoff (2003) proposed solving this by generalis¬ 
ing back-off as a procedure that happens on a graph (instead of the linear procedure 
of dropping context words as in back-off n-gram models). They rely on genetic algo¬ 
rithms to acquire the structure of the back-off graph over the K feature dimensions 
and n sequence positions in an n-gram approach, while the individual conditional 
distributions use standard smoothing or discount algorithms. 
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The factored language model approach has also been applied to neural proba¬ 
bilistic language models ([Alexandrescu and Kirchhoff, 2006), which, as pointed out 
before, remove the necessity to deal with back-off paths. Associations among dif¬ 
ferent features are learnt as part of the network weights rather than having to be 
specified explicitly as present or absent. These factored neural language models 
(FNMLs) in effect associate a distributed representation with each feature compo¬ 
nent of a word, so the overall word representation r w = f'O; ... -,i h ! ] is constructed 
by concatenating the feature vectors i' k} e W l(k \ where 1 < k < K and the dimen¬ 
sionality d(k) may vary across components. 

FLMs are a suitable vehicle for integrating external information into language 
models, including morphology^] They can lower perplexity (Bilmes and Kirch 
hoff 2003; Alexandrescu and Kirchhoff! 2006 Wu et al.[|2012 ) and improve speech 
recognition accuracy ( jVergyri et al.[|2004| ). 

The main shortcoming of FLMs is that the number of factors K is a fixed con¬ 
stant for all words. Therefore they cannot naturally encode variable-length informa¬ 
tion, such as the surface-level morphemes of a word. In largely fusional languages, 
such as Czech, there is likely no loss in assuming a fixed-length segmentation, say 
prefix + stem + suffix]^] But a better solution is needed for agglutinative languages or 
phenomena where the number of morphemes per word can vary substantially across 
the distinct surface words of a given language. 

In Chapters [3]-|4j we present two separate language modelling approaches that 
might be regarded as variations on the idea of FLMs but that have inherent support 
for variable-length decompositions of words. 


2.2.4 Summary 


The preceding section set up some terminology and definitions used in statistical 
language modelling. We introduced the two widely used approaches of back-off 


5 As an example of FLMs using non-morphological information, Adel et al. (2013|l applied them 
successfully to the problem of code-switching by incorporating a language-ID factor for each word. 

6 Another option is to apply FLMs to morpheme-segmented data, so that each morpheme is as¬ 
signed a set of factors (Mousa et al. 201 1| . 
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based n -gram LMs and vector-based DLMs. Finally, we discussed an existing gen¬ 
eral approach to integrating morphological information into both types of language 
models, and pointed out that it does not naturally support strong agglutination. 


2.3 Non-parametric Bayesian modelling 


This section introduces the key elements of the mathematical framework employed 
in Chapters [3] and [5J 

Bayesian modelling centres around two concepts that address the questions of 
inferring the parameters 9 of a model given observed data D, and making predictions 
about a new data point x under the model. First, instead of working with point- 
estimates of 9, the Bayesian approach quantifies the uncertainty about the value of 9 
by means of probability distributions over its potential values. The prior distribution 
P{9 | a) expresses our beliefs about what the value of 9 might be in absence of 
observed training data. The hyperparameters a control the behaviour of the prior 
distribution. 

The objective, however, is to make inferences in light of observed data. The pos¬ 
terior distribution P{9 \ D,a ) quantifies the uncertainty in the values of 9 following 
the observation of training data D. The prior and posterior distributions are related 
by Bayes’ law as 

P{9 | D,a) cx P(D | 9)P{9 \ a), (2.8) 


where the term P(D | 9) is referred to as the likelihood distribution of the data D 
conditioned on the model parameters 9. This factorisation is often useful in problem 


settings where the terms on the right-hand side of ( |2.8[ ) are computed more easily 
than directly computing the posterior probabilities on the left-hand side. 

Our focus falls instead on the modelling convenience afforded by the specifica¬ 
tion of a prior distribution. Judicious choice of prior distribution provides a prin¬ 
cipled way to encode external knowledge or intuitions relevant to a problem into a 
probabilistic model. The effect of the prior on the posterior distribution is strongest 
when little or no data has been observed. The latter two properties of Bayesian 
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modelling makes it especially suitable for our aims of overcoming the data sparsity 
problem in morphologically rich languages by encoding sub-word information into 
models, which we can do through constructing appropriate Bayesian priors. 

The second central concept of the Bayesian approach is posterior inference. The 
prediction of a new data point x is approached by relying on all possible values 
of the parameters 6 rather than a point-estimate. That is, the posterior predictive 
distribution P(x \ D, a) is determined by marginalising the posterior distribution 
over 6: 


P(x | D,a)= / P(x | 9)P(d | D,a)d6. 


(2.9) 


For many types of models, this integral cannot be solved analytically nor computed 
tractably. The two major solutions to this impasse are to approximate the poste¬ 
rior distribution with a simpler function which enables exact numerical optimisation 


(Beal 2003), or to use Markov chain Monte Carlo (MCMC) techniques to approx¬ 


imate the integral with a finite number of samples drawn from the true distribution 


(Neal 1993; Gilks et al. 1996). Our work follows the latter option. 


2.3.1 Pitman-Yor process prior 


A central component of the Bayesian modelling work in this dissertation is the use 
of the Pitman-Yor process (PYP; Pitman[|1995[ Pitman and Yor 1997 Ishwaran and 
James| 2001]) to construct prior distributions that encode sub-word structure. This 
follows widespread and successful application of the PYP in NLP tasks including 
unsupervised learning of grammars ( |Cohn et ak| |2010| |Levenberg et al.[ |2012[ |Co- 
hen et al.[|2010| ), word segmentation ( |Mochihashi et al.[ |2009||Neubig et al.j , |2010| ), 
morphology ( Goldwater et al.[ |2006[ ), parts of speech ( [Blunsom and Cohn] , |20 11 1 ) 
and n-gram language modelling ( |Teh[|2006a[ [Wood and Teh, 2009). 

The suitability of the PYP in these diverse applications, and indeed in this dis¬ 
sertation, arises from two properties. The first is that it falls within the class of 


non-parametric statistical models (Wasserman 2007), which means that the number 
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of model parameters (or more generally, the model complexity) is driven by the ob¬ 
served data instead of being fixed a priori. Non-parametric modelling matches the 
boundlessness of many linguistic objects very well—words, morphemes and syntac¬ 
tic structures all seem to constitute countably infinite sets. The second key property 
of the PYP is that it naturally produces power-law distributions that model the dis¬ 
tribution of word frequencies ( Zipfj 1932) and that are useful for other linguistic 


elements (Goldwater et al., 2011). 

The PYP is a stochastic process that generates distributions over exchangeable 
random partitions of a countably infinite set of elements. A PYP is specified by a 
base distribution P 0 , and two hyperparameters: the discount a and strength h, with 
0 < a < 1 and b > —a. We write it as Vy(a,b,P 0 ). Setting a = 0 reduces 
the PYP to the more familiar Dirichlet process ([Ferguson, 1973] ) with concentration 
parameter a = b. The role of the base distribution is considered in the next section, 
but we note that it determines the support of a draw G from the PYP, where G is 
itself an infinite dimensional probability distribution. 

The PYP has various characterisations. An infinite dimensional distribution G 
sampled from a PYP can be constructed by a stick-breaking procedure (Sethuraman 


1994, Ishwaran and James 2001). However, in our applications it is more useful to 


work with draws from G. The Pitman-Yor Chinese Restaurant Process (PYCRP; 
Ishwaran and James, 2003) provides a means of obtaining such a draw y ~ G and 


circumvents the problem of representing the infinite dimensional object. 

The PYCRP can be defined through the metaphor of an unbounded sequence 
of customers entering a restaurant one-by-one and being seated at tables with un¬ 
bounded seating capacity and of which there is a limitless supply. The table alloca¬ 
tion of customer x % is recorded as table index z% G Z + . Let z n denote the sequence 
of table indices after n customers have been seated, and let r// ; . denote the number 
of customers seated at the k th of the m non-empty tables in the restaurant. Hence 
n = Y^k= i n k- The first customer x\ sits at the first table by definition, i.e. z\ = 1. 
Given n > 1 previously seated customers, the next customer x n+ 1 is seated at table 
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index z n+ 1 G [1, m + 1], sampled from the probability distribution 

n k - a 

P(z n+ i = k | z n , a, b ) = 


n + b 
am + b 


if 1 < k < m 


( 2 . 10 ) 


n + b 


if k — m + 1. 


The dependence on n k in the first case means customers are more likely to be 
assigned to tables that already have relatively more customers. However, assign¬ 
ment to a previously unoccupied table (case k — m + 1) always carries non-zero 
probability, and is itself dependent on the number of occupied tables m. The in¬ 
teraction of this preferential attachment dynamic and the positive attitude towards 
novelty induces a power-law distribution over the number of customers n k at each 


table, provided 0 < a < 1 (Goldwater et al., [2011] ). 

The joint probability of a particular seating assignment z n under the PYCRP is 
obtained as the product of the individual seating decisions: 


P (z | a,b) = P( Zl | o,6) ■ | Zj_i, a, b) 

i=2 

m 

(a(l) + b)... ( a{m - 1) + b) x Yl ((1 — a)... (n k — 1 — a)) 


= 1 • 


k =1 


(1 + b)... (n — 1 + b) 


( 2 . 11 ) 


( 2 . 12 ) 


m— 1 

n (ak + b) 

k =1 


r(i + b) _ 

T(1 + b) (1+ &)... (ra-1 + 6) 


x 




r(i + b) 

T (n + b) 


' m— 1 


n ( ak + & ) ii 


,/c=i 


r (n k - a) 
= \ r(l-a) 


(2.13) 

(2.14) 


The numerator in ( ]2.12[ ) accounts separately for the initial occupation of the tables 
and for the assignment of subsequent customers to each table. The final form is 
efficient for computation, making use of the Gamma function which is defined as 

I'M = /, 


°° u x 1 e u d u for x > 0, with the identity T(x) — (x — 1 )T(x — 1) for 


real-valued x > 0. 
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2.3.2 Pitman-Yor process base distributions 

A sequence of table assignments z n under the PYCRP is fundamentally just a par¬ 
tition of a sequence of integers, {1 ,... , n}, into m clusters such that they exhibit 
certain frequency distributions. For concrete modelling applications, the usefulness 
of this depends on what meaning is assigned to the customers and their tables. The 
role of the base distribution P 0 of the PYP is to provide that meaning. In the restau¬ 
rant metaphor, the first customer ay to be seated at table with index k triggers the 
sampling of a dish If. from the base distribution, £k ~ Pq- All subsequent customers 
to join a non-empty table k share in the same dish £ k previously chosen for that table. 

If we define the identity y % of a customer ay at table z % to be the table label l Zi , 
the effect is to produce a sequence {?/*} distributed as follows: 

Vi | G ~ G 

G | a, b, P 0 ~ Vy(a, b, P 0 ) 


The conditional probability of the identity y n+ 1 for an additional customer x n+ \, 
given labels t for the n customers seated at m tables according to the seating ar¬ 
rangement z n , is 


P(y n +1 = y Zn, i, a,b) = }. 8y(h) —77 + - ~T P 'M 

7i + b n + b 

k= 1 

N y — arriy + ( am + b)P 0 (y) 
n + b 


(2.15) 


where 5 is the Kronecker delta indicator function, N y is the number of customers 
seated at tables labelled Ik = y, and m y is the number of such tables. 

As an exampleQthe PYP can be used to define a simple unigram language model 
over a closed vocabulary V. Let the base distribution P 0 = Uniform(| V|) and regard 
a customer x as a generic word token. A label £ drawn from the base distribution and 
assigned to the token via the table it joins in the associated restaurant then supplies 
the token its identity as a particular word type from the vocabulary V. Note that the 


7 This example derives from the connections between language model smoothing and the PYP 
pointed out by Goldwater et al. ( 2006| ; Teh ( 2006a| . 
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same label i can be drawn repeatedly from P 0 . The term N y — am y in the numera¬ 
tor of Equation ( |2.15[ ) can be understood to discount the frequency N y with which 
a word y occurs in a corpus by an amount am y , while the second term provides 
interpolation with the uniform base distribution, in analogy to the language model 


smoothing described by Equation ( |2.5[ ), p 15 This smoothing ability forms the basis 
of the language model (Teh, 2006a) that we extend in Chapter [3j 

The base distribution can, however, also be used to generate more complex labels 
than word identities. In particular, they can be used to generate trees of various ki nds 
( |Cohn et ah} |2010| |Johnson et al.[ |2007a| |Levenberg et al.[[2012[ ), a line of work we 
extend to another grammar formalism in Chapter [5} 


2.4 Summary 

This chapter described the two approaches to sequence-based statistical language 
modelling that subsequent chapters extend by integrating information about sub¬ 
word structures. In the second part of the chapter, we introduced the aspects of 
Bayesian modelling with the non-parametric Pitman-Yor process priors that we em¬ 
ploy in the next chapter and the penultimate chapter of this dissertation. 
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Chapter 3 

A Hierarchical Bayesian Language 
Model for Compounding 


Chapter Abstract 


In this chapter we introduce an n-gram-based LM that exploits the 
sub-structure of single-token compound words to attain better smooth¬ 
ing. The LM extends the Hierarchiccd Pitman-Yor language model {Teh 


2006a with the ability to provide probabilities for out-of-vocabulcuy 
compound words. This chapter is an extension of material originally 


published as [Botha 2012 > and ([ Botha et al. 2012\. 


Compounding is a process that forms words by combining two or more other 
words. For example, in English, the independent words space and ship may com¬ 
bine to form the noun spaceship, while inspiring and awe form the hyphenated ad¬ 
jective awe-inspiring. Many languages make use of compounding as an important 
vehicle for novel word formation. Our focus on compounding is motivated by the 
prevalence of closed compounds in certain languages. Such compound words are 
written as single orthographic units, and can pose problems to LMs, and NLP sys¬ 
tems generally, that regard punctuation-delimited tokens as their elementary units 
of analysis. Closed compounds are especially problematic in languages that exhibit 
highly productive compounding, such as German, Dutch and Afrikaans. In these 
instances, compounding can occur recursively, theoretically without bounds. Word- 
based LMs are thus prone to suffer from sparse data effects that can be attributed 
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to compounds specifically. The aim of the model we propose is to overcome data 
sparsity due to compounding by accounting for compound structure. 


A compound is said to consist of a head component and one or 
fier components, with optional linking elements between consecutive 
(Goldsmith and Reutter. [~1998[ ). 


more modi- 
components 


Examples of German compounds 

• A basic noun-noun compound: 

Auto + Unfall = Autounfall (‘car crash’) 

• Linking elements can appear between components 

KUche + Tisch = Kuchentisch (‘kitchen table’) 

• Components can undergo stemming 

Schule + Hof = Schulhof (‘schoolyard’) 

• Compounding is recursive 

{Regen + Schirm ) + Hersteller 
= Regenschirmhersteller (‘umbrella manufacturer’) 

• Compounding extends beyond noun components 

Zwei-Euro-Miinze (‘two Euro coin’) 

Fahrzeug (‘vehicle’) - from fahren (‘to drive’) + Zeug (‘stuff/thing’) 


3.1 Structural dependencies 


The linguistic intuition that we propose to exploit in our language model is that the 
head of a compound word determines the morphosyntactic properties of the whole 
word (Williams, 1981). We focus our investigation on German, which is predomi¬ 
nantly right-headed. The right-most component of a compound thus determines the 
agreement requirements that needs to be fulfilled between the compound itself and 
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other words in the sentence. For example, the Bahn (‘track/rail’) in the compound 
Eisenbahn (‘railway’) identifies the word as singular feminine, which determines 
the requirements for its agreement with verbs, articles and adjectives. 

A language model could therefore give a reasonable assessment of the syntactic 
fluency of a sequence of German words by abstracting away the non-head compo¬ 
nents of compounds. For example, the sentence I’m, going by train can be rendered 
in German as either of the following: 

• Ich fahre mit der Eisenbahn. 

• Ich fahre mit der Bahn. 

Collapsing all compounds to their heads and ignoring modifiers would decrease 
sparsity and allow more robust a-gram probabilities to be estimated from data. But 
such a strategy would not be probabilistically sound as a generative model of a cor¬ 
pus. Moreover, a model that ignores modifiers would assign the same probabil¬ 
ity value to Eisenbahn and the empirically much rarer Bobbahn (‘bobsled’), which 
would be unsatisfactory in a task where the language model plays a discriminative 
role. The model needs to account for the non-head components in some way. We 
expect the identity and number of modifier components to be strongly correlated 
with the identity of the head. In particular, the conditional distributions of modifier 
given head will be sharply peaked. A simple approximation is thus to assume that, 
conditioned on the head, modifiers are generated by a reverse n-gram model: 

P {eisenbahn \ mit der) = P{bahn \ mit der) x P(ciscn \ bahn) x P($ \ eisen) 

The end symbol $ indicates the word boundary and doubles as a control on the 
number of modifiers. In general, we will use this process as a back-off strategy, 
i.e., when the trigram mit der Eisenbahn is unobserved. Note that this is markedly 
different from linguistically naive back-off models that would score the unobserved 
trigram mit der Eisenbahn by falling back on bigram or unigram estimates. In our 
model, we instead permit the model to back off to this decomposition before drop¬ 
ping valuable context information. 
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mit der Draht-seil bahn 




Figure 3.1: Intuition for the proposed generative process of a compound word: The con¬ 
text generates the head component, which generates a modifier component, which in turn 
generates another modifier. (Literal translation: “with the cable car”) 


3.2 An n-gram model with productive compounding 


This section defines a hierarchical Bayesian model that embeds the aforementioned 
intuitions about compound formation into an n-gram language model. The approach 
is to extend the Pitman-Yor language model (HPYLM; Teh[ 2006a) by modifying its 
base distributions to include a sub-model for generating a compound’s modifiers 
conditional on its head. 

A seemingly obvious alternative method of integration would be to to use two 
distinct word-level and compound-level n-gram models, independently built using 
existing heuristic smoothing algorithms. However, the advantage of the unified hi¬ 
erarchical Bayesian model proposed here is that it can learn an interpolation be¬ 
tween those levels, obviating the need to introduce and tune an extraneous interpo¬ 
lation scheme between sub-models, while opening the door for future extensions, 
e.g. analysing compounds occurring in the n-gram history. 


3.2.1 Hierarchical Pitman-Yor Language Model (HPYLM) 

The HPYLM applies the technique of iteratively shortening the context of an n -gram 
to affect smoothing over an n-gram conditional distribution, as used by conventional 
n-gram LMs. Let u denote the context of an n-gram vj™, i.e. u = [w±,..., tu n -i], 
and define 7r(u) as a function that truncates u by dropping the left-most context- 
word. 

The HPYLM defines the distribution G' u of a word w given its context u as being 
drawn from a Pitman-Yor process with hyperparameters a\ u \ and fe| u |. The base dis¬ 
tribution is taken to be the distribution u ) governing the probability that the word 
w follows the truncated context 7r(u), as in the case of back-off models. Since this 
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latter distribution G n ( u ) likewise needs to be modelled, we assume it is itself drawn 
from another PYP, which has hyperparameters a| u |-i and iO| u |_i. This procedure is 
applied recursively until reaching the unigram distribution G$, for which a final PYP 
prior is used, with its base distribution G' 0 being the uniform distribution over the 
vocabulary, P(w) = 1/]W|. 

Conditioned on the context u, a word w is generated as follows: 


G u 


(3.1) 

Vy(a\ u \,b\t 

j| ) G n ^ij) , 

(3.2) 

Vy{a\ u \- U 

_o- 

e 

i 

o 

2 

(3.3) 


G® ~ Vy(ao, bo, Go) 

G 0 = Uniform (| TV |) 

3.2.2 HPYLM + compounds (HPYLM+c) 


(3.4) 

(3.5) 


We define a compound word w as a sequence of components [ci,..., c L \, plus an 
end symbol $ marking either the left or the right boundary of the word, depending 
on the direction of the model. To maintain generality over this choice of direction, 
let A be an index set over the positions, such that c\ x always designates the head 
component. 


Following the motivation in §3.1 we introduce an additional level of back-off 
where the head component ca 1 is conditioned on the context u of the word, while 
the remaining components c\ 2 ,..., c \,, as well as an end-of-sequence symbol $, are 
generated by a modifier model F, independently of the context u. We use a bigram 
HPYLM for F. 

The two main modifications we make to the word-based HPYLM to obtain the 
HPYLM+c are as follows: 

1. Replace the support of the base distribution at the root of the hierarchy with 
the reduced vocabulary Ai, the set of unique elementary components c. Ai 
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also includes word types consisting of a single component to begin with. 


2. Add an additional level of conditional distributions H u (with |u| — n — 1) 
where items from Ai combine to form the observed compound words. 


The process for generating a 
ure 


3.2b): 


word from the HPYLM+c is as follows (see also Fig- 


Draw head: ca x ~ G u (3.6) 

G u ~ 'Py(a\ u \, 6| u |, G*7r(u)), (3.7) 

67tt(u) ~ (0|u| — 1 j ^|u| — 1 j G-xoTriu) ) (3.8) 

G% ~ Vy(ao, bo, Go) (3.9) 

Go = Uniform (| M. \ ) (3.10) 

Draw modifiers: c\ i ~ F Ca , 2 < i < L (i.e. until $ is drawn) (3.11) 

F c ~ Vy(a[, b[, F 0 ) (3.12) 

F$ ~ Vy(a' 0 , b' 0 , F 0 ) (3.13) 

F 0 = Uniform (|Ad|) (3.14) 

Create word: w = concatenate (ci,.. ., cl) (3.15) 

w ~ Ff u (3.16) 

H u ~vy(a\' ul ,b\’ u \,G u xF) (3.17) 


So the base distribution for the prior of the word n-gram distribution H u is the 
product of a distribution G u over compound heads, given the same context u, and 
the modifier model F which conditions on the head component. 

For German, the linguistically motivated choice for conditioning in F is A lmg = 
[L, L — 1,..., 1] such that c\ 1 is the true head component; $ is drawn from F(-|ci) 
and marks the left word boundary. 

In order to see if the correct linguistic intuition has any bearing on the model’s 
extrinsic performance, we will also consider the reverse, supposing that the left¬ 
most component were actually more important in this task, and letting the remaining 
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(b) Trigram HPYLM+c generating word w accord- 
(a) Trigram HPYLM generating ing to the conditioning scheme A lmg , such that cl 
word w. is the head and c\... cl -i are the modifiers. 


Figure 3.2: Plate diagrams illustrating the generative structure for two trigram models 
F1PYLM (left) and FlPYLM+c (right). Variables w -2 and w -1 denote the two words in 
the trigram context. Hyperparameters and their priors are omitted for clarity. 

components be generated left-to-right. This is expressed by A mv = [1 ,L\, where 
$ this time marks the right word boundary and is drawn from F(- \cl). 


3.2.3 Example 

This section provides an example of how the models are instantiated. We illustrate 
the trigram case and focus on the structures related to the following fragment of a 
hypothetical corpus: 

Observed n-grams Segmentation 

dem alten schulhof (‘the old schoolyard’) schul-hof 

dem altenfriedhof (‘the old cemetery’) friedhof 

des alten regierungschefs (‘the old head of government’) regierungs-chefs 


The hierarchies induced by these n-grams in both models can be visualised as 


trees, as shown in Figure 3.3 The key thing to notice is how the HPYLM+c com¬ 
bines the statistics of distinct compounds (schulhof and friedhof) sharing a head, as 
indicated by the presence of two /?p/'-customcrs in the restaurants of both 6(| dem , a iten] 
and G* 0 , unlike in the HPYLM. 
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(a) HPYLM (b) HPYLM+c 

Figure 3.3: This visualisation illustrates the tree structures formed by the hierarchical priors 
in the F1PYLM and FlPYLM+c models in representing the corpus fragment given in ^3.2.3 
Grey elements represent some of the customers seated within the various restaurants and 
each asterisk denotes a single customer. Only the minimal number of customers relevant to 
the example are shown explicitly. 


Under the HPYLM, the probability of the word schulhof given the context dem al¬ 
ien , conditioned on a particular seating arrangement across the hierarchy, is com¬ 
puted recursively using|Equation 2. 15](p. [26]), as follows: 


-Phpylm (schulhof | dem alien) 


= G 


[dem, alten] 


N. 


(d-a.) 


schulhof 


( schulhof) 

' + (°2ra (<u ) + b 2 ) G men] (schulhof) 


n 


( d -a.) _|_ Jj 


n (d.a.) _|_ ]j 2 


(3.18) 

(3.19) 

(3.20) 


with 


Qalten] (schulhof) 


Gn, (schulhof) 


^schulhof - ai^^uihof ! ( a i m(a ° + Gf)(schulhof) 

nG) + bi nG) + b\ 

Schulhof ~ ao^Suihof | ( Q o m(0) + M Go(schulhof) 
n + b 0 nW + b 0 
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where Go(schulhof) = 1/|W|. The notation here uses superscripts to identify the as¬ 
sociated restaurant in the hierarchy. For instance, iV s ( c d h ^ hof is the number of customers 
seated at tables labelled schulhof in the restaurant for the context d.a. = [dem, alten], 
while m' chulhul is the number of tables labelled schulhof in the restaurant for the con¬ 
text a. = [alten]. 

Under the HPYLM+c, the same n-gram probability is computed as follows, with 
the main difference being the product in the base distribution probability: 


-Phpylm+c (schulhof \ dem alten) 

= Ifdcm, alien | (schulhof) 

AT (d .a.)" „ (d.a.)" 

schulhof a 2 m schulhof 


Tl 


(d-a .)" 


+ K 


(3.23) 

(3.24) 

(3.25) 


a « m (d. a .)" +b /fj ^G [dem alten] (hof) x F [bof] (schul) x F [scbul] ($) 


(3.26) 


n (d-a.) _|_ 5" 

Gldem, alten] ( hof) is computed similarly as before, but now has the argument hof and 
will bottom out with a term Go (hof) = 1/\M\. The cascade of computations in the 
modifier model has a similar form, for instance: 


F [bof] (schul ) = 


F® (schul) = 


Jv schul “l'^schul 


n 


( h -)' + b[ 


N 


( 0 )' 

schul 


/ ( 0 )' 
«0'"4hul 


+ 


+ 


n^y + b' 0 

where h. = [hof] and F 0 (schul) = 1/\M\. 


(afm^-y + b'^j Ft$(schul ) 
+ b\ 

(a' 0 m (0)/ + b'o) F 0 (schul) 

n^y + b' 0 


(3.27) 

(3.28) 
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3.3 Inference 


For ease of exposition we describe inference with reference to the trigram HPYLM+c 
model, but the general case should be clear. The model is specified by the latent 
variables C = (Gjgj, G%], Gy UjV ], Fg, F c ), where u,v e W, c e M, and hyper¬ 
parameters fl = (a,;, bi , a'-, bp a", // 2 '), where i = 0,1,2; j = 0,1; single primes des¬ 
ignate the hyperparameters in F and double primes those of // U: ,,i. By marginalising 
out the latent variables in C, we can use a collapsed Gibbs sampler to do inference 
via the hierarchical variant of the Pitman-Yor Chinese Restaurant Process (PYCRP) 


(Teh 2006b). 


When the prior of G u has a base distribution that is itself PYP-distributed, 
as in the HPYLM, the restaurant metaphor changes slightly from the description in 


chapter 2 In general, each context u has an associated restaurant. Whenever a cus¬ 


tomer is seated at a previously unoccupied table in some restaurant R which has a 
PYP-distributed base distribution, a dish must be sampled from that base distribu¬ 
tion. That is, a further customer is spawned and joins the parent restaurant pa(F) 
associated with the context vr(u). This induces a consistency constraint over the hi¬ 
erarchy: the number of tables me labelled with dish l in restaurant R must equal the 
number of customers Ne at tables serving dish i in the parent restaurant pa(/T). 

We take care to satisfy this constraint in the HPYLM+c model, where a restau¬ 
rant associated with the distribution H u has as base distribution a product of PYP- 
distributed distributions, F and G u . Here, when a dish l is drawn for a previously 
unoccupied table in trigram restaurant Hy UiV -\, a customer ca, joins the corresponding 
trigram restaurant G[ u>v \, and customers ca 2 , ..., ca,, $ are sequentially sent to the 
restaurants associated with F Ca ,..., F Ca , respectively. 


3.3.1 Posterior inference 


The Chinese restaurant representations effectively allows us to replace the collection 
of Pitman-Yor priors with a collection of seating arrangements, S. Marginalising 
over the seating arrangements S and parameters Q yields the posterior predictive 
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probability of a word w given the training data V: 


p(w \v)= p(w | s,n)P(s,n | v) dSdn, 


(3.29) 


I s,n 


This integral can be approximated by averaging over a number of posterior sam¬ 
ples (S, Q) generated using Markov chain Monte Carlo methods. In particular, we 
can use a Gibbs sampler that explores the space of possible seating assignments by 
iteratively resampling a single model variable conditioned on all other model vari¬ 
ables. Given a current seating arrangement S, a new configuration S' is sampled by 
individually reseating each customer Xi in the trigram restaurants //[„_,.]. Reseating 
a customer Xi means removing them from their current table z,- and assigning an¬ 
other table k! drawn from the conditional distribution P(z % | z _l ). Here, z“* denotes 
the table allocations (zi, ..., Zj_i, z i+ 1 ,..., z\ z \) of the other customers in the given 
restaurant. 

If the original table z, only seated a single customer, the removal of that cus¬ 
tomer triggers further customer removals from the base distribution restaurants in 
accordance with the hierarchy consistency constraints. 

In the absence of any strong intuitions about appropriate values for the hyperpa¬ 
rameters, we place vague priors over them and use slice sampling ( Neal[ 2003j ) to 
update their values during generation of the posterior samples: a ~ Beta(l, 1) and 
b ~ Gamma(10, 0. i)0 

Lastly, we follow the pragmatic course of using a single sample (S', Q) to ap¬ 
proximate the integral over the posteriorly 


3.3.2 Sampler description 

The previous section outlines the Gibbs sampler in broad terms. In this section, we 
detail the algorithms involved. 

1 We used the slice sampler implementation released by Mark Johnson, 

http://web.science.mq.edu.au/~mjohnson/Software.htm 

“Preliminary experiments (C. Dyer, p.c.) indicated that the posterior over the latent model struc¬ 
ture is quite sharply peaked, so that a single sample constitutes a low-variance estimator of the pos¬ 
terior predictive distribution. 
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The scheme described above poses the challenge of keeping track of a table 
assignment for each individual customer in every restaurant. Given that language 
models need to be trained on substantial amounts of data, this bookkeeping can be 


problematic, both in terms of memory and implementation. 


Blunsom et al. (2009) introduced an efficient solution to this problem, which 


leverages the exchangeability property of the hierarchical Dirichlet process and which 
carries over straightforwardly to the PYP. Instead of tracking what table each cus¬ 
tomer is seated at, the authors suggest maintaining histograms of table occupancy. 
When a customer has to be added or removed from a restaurant, it is done by sam¬ 
pling directly from the histogram. A histogram Hist u , maps a table occupancy value 
t to the number of tables labelled with w and having t customers, Hist w [f]. 

As a brief example, suppose we have an assignment of 6 customers to tables 


as Zi=l, z 2 =l, z 3 =2, Z4= 1, ^5=3, z 6 =4 and the tables are labelled as £\-£ 3 =sun, 
t 2 =(U= m oon. The equivalent histogram representation is Hist v „„[3]=1. Hist JU „[l]=l 
and Hist moo „[l]=2. 

The two basic operations required for updating the seating arrangement in the 
HPYLM+c are AddWordToModel and RemoveWordFromModel, as de¬ 
fined in Algorithms |T]-[2j To facilitate reading the algorithms in conjunction with 
the mathematical definitions given before, the psuedocode retains the quantities N w , 
m w , and m, but we note that these can all be reconstructed from the histogram rep¬ 
resentations as follows: 


N w = V t x Hist w [t] 

(3.30) 

tEHistu, 

m w = Hist™ [t] 

(3.31) 

Histu, 

and m = ^ m w , as stated earlier. 

W 

(3.32) 
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Algorithm 1 Adding an //-gram (u, w) to the HPYLM+c. 


1 
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5 
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function AddWordToModel(uj, u) 
c 4- SegmentCompound(w;) 

L 


t> for example, as detailled in §3.4.2 


Pq G u (caJ x JI Pca.^ (caJ x F c ($) > base probability, cf. 

i =2 

if AddCustomerToRestaurant(w, ^^u,a" u| ,&" u| )then 
RECURSIVEADDicAj, u, G*. a, b) 
for i 4— 2, L do 

RECURSIVEADDtCAi, Ca 4 _j, F* ,a', b') 


Equation 3.17 


RecursiveAdd($, ca l , F*, a', b') 

end function 


function AddCustomerToRestaurant(u>, Pq, X, a, b) 

> Hist w is the table occupancy histogram for customer type w for the restaurant X. 
isNewTable 4— false 
Pshare = max(N w - dX m w , 0 ) 

Pnew = (ax m + b) X Ptf > ... 
r 4- random(Q,p share + 

Pnew ) 

if r < p new then 

Histu, [1] 4— Histu, [1] + 1 
isNewTable 4— true 
else 

r 4- random(0, p sha re) 

for all t G Histu, do > sample a table occupancy from histogram 
r 4— r — (Histu, [t] x (t — a)) 

if r < 0 then > add to a table currently containing t customers 
Histu, \f + 1] 4— Histu, [t “f 1] T 1 
Histu, [£] Histu, [f] — 1 

break 

N w 4— N w + 1 

return isNewTable 

end function 


> based on ... 


Equation 2.10 52.75 


function RecursiveAdd(u>, u, X*, a, b) 

if u = 0 then return 

4— X„( u )(w) 

if AddCustomerToRestaurant(ui, Pq, X u , a ( | u |), b(| u |)) then 
RecursiveAddO, 7t(u), X*, a, b) 

end function 
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Algorithm 2 Removing an n-gram (u, w) from the HPYLM+c. 

1: function RemoveWordFromModel(w, u) 

2: c v- segment(w) 

3: if RemoveCustomerFromRestaurant(w, H u ) then 

4: RecursiveRemoveCcaj, u, G*) 

5: foriv- 2,Ldo 

6: RecursiveRemoveCca,, ca ; _ 15 F 1 *) 

7: RecursiveRemove($, ca^,-F*) 

8: end function 

9: 

10: function RemoveCustomerFromRestaurantCu;, X) 

11: > Hist w is the table occupancy histogram for customer type w for the restaurant X. 

12: isTableEmptied A- false 

13: r <— random {0, N w ) 

14: for all t G Hist u , do > sample a table occupancy from the histogram 

15: r G- ?— Hist m [t] x t 

16: if r < 0 then > remove from a table currently containing t customers 

17: Hist„, [f] Hist™ [t] — 1 

18: if t > 1 then 

19: Hist u , [t — 1] G- Hist u , [t — 1] 4-1 

20: else 

21: isTableEmptied G- true 

22: break 

23: N w G- N w - 1 

24: return isTableEmptied 

25: end function 

26: 

27: function RecursiveRemove(w, u, X*) 

28: if u = 0 then return 

29: if RemoveCustomerFromRestaurantCu;, X u ) then 

30: RecursiveRemove(w, 7t(u), X*) 

31: end function 
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Parts/Compound # Compound Types 

_ 2 

En De De LM 3 

Sentences 1.7m 1.7m 2.4m ^ 

Word Tokens 49m 38m 59m ~~ 

Word Types 112k 351k Total number: 223614 

(a) Statistics of training corpora. (b) Compound types by length. 

Table 3.1: Summary of training data and compound segmentation. The combined reading is 
that 223614 of the 351k word types are designated as compounds. 

3.4 Experiments 


197233 

25128 

1194 

59 


We evaluate the proposed compound model HPYLM+c on German in terms of its 
intrinsic performance (§ |3.4.4| ), and by its influence when used as part of a machine 
translation system (§ |3.4.5 ). In both settings, we compare against baselines obtained 
with the standard HPYLM as well as the Modified Kneser-Ney (MKN) LM. 

The next two sections describe further methodology details. 


3.4.1 Data preparation 


We use corpus data released as part of the 2011 ACL Workshop on Machine Trans¬ 
lation’s shared taslQ Standard data preprocessing steps included normalising punc¬ 
tuation, tokenising and lowercasing all words. 

Lor language model training, we used the combination of the corpora Europarl, 
news-commentary and news-2011, filtered to exclude duplicate sentences. This 
amounts to 59 million word tokens. 

The language model vocabulary W consists of all word types appearing in the 
target-side of the bitext. As the machine translation system used in the subsequent 
extrinsic evaluation cannot output word types outside of this vocabulary, the choice 
was made not to model anything beyond that j^] LM training tokens outside of that 


“http://www.statmt.org/wmtll/ 

for details on bitext preprocessing. 


See 5 3.4.5.1 
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vocabulary were replaced by the Unk symbol. Models were trained using all result¬ 
ing 4-grams; i.e. no pruning was done on low-frequency n-grams. 

Test data for language model evaluation comprised the combination of the news- 


test corpora of 2008-2010, released with WMT11. Statistics are provided in Ta- 
Ible 3.1al 

3.4.2 Compound segmentation method 

For this evaluation, we used a deterministic, a priori segmentation of compound 
words into their parts. This means we assume a single, fixed analysis of a compound 
regardless of the context it occurs in, which is necessitated by the fact that our prob¬ 
abilistic model does not specify a step for choosing an analysis. To construct a 
segmentation dictionary, we used the one-best segmentation output of a supervised 


maximum-entropy-based compound splitter (Dyer, 2009) on all the words in the 


training vocabulary. In addition, word-internal hyphens were also taken as segmen¬ 
tation points]^] We regard the set of compounds C C Was those word types that 
are split into two or more parts by this procedure. Similarly, M. c C M. is the set 
of unique word parts that result from segmenting the compounds C. The majority of 
compounds thus identified consist of two or three parts (Table [3Tb| ). 

In our method, linking elements do not constitute items in the vocabulary of word 
components Ai . Regarding linking elements as components in their own right would 
sacrifice important contextual information and disrupt the conditional bigram distri¬ 
butions, F(-|c Ai _ 1 ), generating compound modifiers. That is, faced with the com¬ 
pound kiiche-n-tisch (‘kitchen table’), we want the generation of the modifier to be 
according to P(kiiche|tisch), and not P(kuche|n). To account for linking elements 
we follow the pragmatic option of merging any linking element onto its adjacent 
component—for A ling , merging happens onto the preceding component (i.e. using 
P(kiichen|tisch)), while for A mv , the linking element is merged onto the succeeding 


5 Our choice of a supervised compound splitter is based on a desire to focus the evaluation on 


the language model given high quality segmentations. The splitter of Dyer (20091 obtained a seg¬ 
mentation F-score of ~95% in their evaluation. Unsupervised methods could also be used with our 
model. 
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Linking element handling 


\M\ 

Segmented example 

(no segmentation, i.e. M. — W) 

223614 

350997 

Kiichentisch 

delete 

45183 

144358 

Kiiche-tisch 

merge left (A ling ) 

50868 

152323 

Kuchen-tisch 

merge right (A mv ) 

63315 

164095 

Kuche-ntisch 


Table 3.2: Effect of linking elements on the number of unique word components modelled. 
The second column is the number of components measured only over the set of compounds 
C, while the third column is over all words W, showing what the size of A4, the support 
of the base distribution Gq, would be under the alternative schemes for handling linking 
elements. 


component (i.e. using P(ntisch|kiiche)). This ensures that the set of head compo¬ 


nents {ca!} is not fragmented 6 Table 3.2 summarises the effect of these different 
merging options on the size of the vocabulary Ai of word components modelled. 


3.4.3 MCMC sampler configuration 


We initialised the samplers used for the HPYLM and HPYLM+c with random seat¬ 
ing assignments instead of basing the initial configuration on the actual conditional 
distributions for adding customers to restaurants. This worked satisfactorily and 
was convenient in our implementation, so we did not experiment with alternative 
initialisation schemes. 

Each sampler was then run for 300 iterations of bum-in, and we gauged con¬ 
vergence toward the underlying stationary distribution at the hand of the posterior 
likelihood ( Figure 3T4] ). This approach takes a cue from earlier work on the HPYLM 
on non-trivial amounts of training data, where a similarly low number of iterations 


was used (Huang and Renals, 2010). One consideration for why this works rea¬ 
sonably well—in light of the favourable comparisons of the HPYLMs to traditional 
back-off LMs—is that the model does not need to infer complex latent structure, as 
may be the case in other MCMC applications where vastly more iterations are stan¬ 
dard practice. The identities of word tokens are fully observed, so that the sampler 


6 It is worth noting that for German the presence and identity of linking elements between consec¬ 


utive components c, and c,_|_i are largely governed by the first component c, (Goldsmith and Reutter 


1998). A more sophisticated model could incorporate this bias. 
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Figure 3.4: Sampler burn-in for the 4-gram models. 


primarily has to explore different seating arrangements across the hierarchy of Chi¬ 
nese restaurant processes. Moreover, the training data tends to be sparse, so that the 
range of possible configurations for the restaurants corresponding to rare n-grams 
are limited by a paucity of customers. 


3.4.4 Intrinsic evaluation 


The modelling approach proposed in this chapter is premised on the hypothesis that 
better smoothing of n-gram distributions can be obtained by analysing compounds 
into their components. This intrinsic evaluation aims to test that hypothesis and to 
explore the relative performance of the model variants that make different assump¬ 
tions around the conditional dependencies of compound parts. 

The HPYLM+c models were compared against two baseline methods, one using 
MKN smoothing and the other being the HPYLM, both without any special handling 
of compounds]^] 

3.4.4.1 Results 


Test-set perplexities are summarised in Table 3.3 We find that the HPYLM obtains 
a perplexity of 294.0 on our test-set, lower than the MKN perplexity of 299.9. This 


'The SRILM toolkit was used for training and testing the MKN models (Stolcke 


2002 ). 
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Model 

PPL 

vs. MKN 

vs. HPYLM 

MKN 

299.9 



HPYLM 

294.0 

-2.0% 


HPYLM+c ling 

294.1 

-1.9% 

+0.0% 

HPYLM+c inv 

305.5 

+1.9% 

+3.9% 


Table 3.3: Test-set perplexity results for two variants of the proposed compounding lan¬ 
guage model and two baseline language models. The last two columns give the percentage 
perplexity reduction obtained by the model of the given row relative to each baseline. 


outcome is consistent with earlier results that the HPYLM outperforms the MKN 
model for training data sets of non-trivial sizes (Huang and Renals, 2009). 

The HPYLM+c that assumes right-headed compounds (A ling ) virtually matches 
the performance of the HPYLM baseline, a result that we consider in more detail 
shortly. The assumption of left-headed compounds (A mv ), which is inconsistent with 
the known properties of German, leads to a marked degradation in perplexity. The 
HPYLM+c with this assumption achieves perplexity of 305.5, which is higher than 
the perplexity of both baseline models. This result confirms the validity of our strat¬ 
egy to condition actual compound heads on sentential context. 

The aforementioned comparison features a bias toward the word-based baseline 
models, since the perplexity of different models on a given data set is directly com¬ 
parable only if the models are defined over the same vocabulary. This is not the case 
here—according to the generative process of the HPYLM+c ( ^3.2.2 ), the genera¬ 
tion of an additional modifier component c 6 M. for any partially formed token w 
always carries non-zero probability. HPYLM+c thereby defines a distribution over 
a countably infinite set of word types, compared to the finite vocabularies modelled 
by the word-based HPYLM and MKN. The preceding results therefore establish a 
type of upper bound on the perplexities of the HPYLM+c models on the test-set. 

To obtain a fairer comparison, we performed the costly renormalisation of the 
right-headed HPYLM+c over the finite vocabulary of words modelled by the base¬ 
line LMs. We find that the test-set perplexity of the renormalised model is 292.2, 
which is below the 294.0 of the HPYLM baseline. The fact that the difference in per- 
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plexity between the infinite model and its renormalisation is not too great suggests 
that the HPYLM+c reserves a reasonable amount of probability mass for words out¬ 
side the finite vocabulary; it would not be ideal if too large a portion of mass was al¬ 
located beyond that vocabulary, since “most” items in the countably infinite domain 
of the model would not constitute well-formed compounds that could practically 
occur in the language. 

3.4.4.2 Effect of Markov order 

The baseline word-based language models deal with unseen test n-grams by pro¬ 
gressively truncating the context to rely on lower-order distributions for which the 
training data is less sparse. The simplifying assumption encoded by this strategy 
is that novelty in n-grams arises strictly from the high likelihood of encountering a 
new element from the space of |W| n possible n-grams. 

In the HPYLM+c, the back-off procedure for handling unseen n-grams is first 
to decompose the predicted compound into its components, and thereafter to trun¬ 
cate context. The underlying assumption is therefore that compounding is the pri¬ 
mary driver of novelty in n-grams, and that the space of possible n-grams should be 
viewed as |W| n_1 x \M.\ L , where the number of components in a compound, L, is 
in principle unbounded but in practice likely to be bounded at the order of 10. 

The aforementioned assumption in the HPYLM+c is violated when compound¬ 
ing is not responsible for an unseen n-gram. The model would then prematurely de¬ 
compose the predicted word when in fact immediately truncating the context without 
decomposing the predicted word would result in matching some observed k -gram 
(1 < k < n ). 

The impact of this effect is measured in a further experiment by training language 
models with lower maximum orders, n — 2 and n = 3. At lower n. there should be 
fewer test tokens for which premature decomposition could occur, in principle. The 
results in Table |3.4| show that the relative margin of improvement by HPYLM+c over 
HPYLM increases as n is decreased from 4 to 2. This outcome implies that priori- 
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n=2 

n=3 

n=4 

MKN 

394.5 

307.2 

299.9 

HPYLM 

396.6 

303.3 

294.0 

HPYLM+c 

390.0 

299.3 

294.1 

HPYLM relative to MKN 

+0.5% 

-2.6% 

-1.9% 

HPYLM+c relative to HPYLM 

-1.6% 

-1.3% 

+0.0% 


Table 3.4: Test-set perplexity for models trained with different n-gram orders, along with 
relative differences at the bottom. HPYLM+c refers to the variant using A ling . 

rising generalisation through compound decomposition higher than generalisation 
through context truncation tends to constrain the model’s performance. 

Convergence of lower-order models For n = 2 and n = 3, the sampler had not 
fully converged after 300 iterations. We suspect this is due to the higher entropy in 
the distributions governing the seating assignments: If n = 2, there should be more 
customers (hence more possible seating configurations) in the average restaurant for 
a particular context-length j < n than in the same restaurant if n is larger. This 
did not affect perplexity, which was stable when evaluating with different individual 
samples from the posterior around 300 iterations. 

3.4.4.3 Breakdown by different test subsets 

A more fine-grained view of model performance can be obtained by partitioning test 
tokens into meaningful subsets and computing the perplexity over the test tokens in 
each subset. In this section, the subsets are characterised by the type of token (com¬ 
pounds vs. non-compound) and the “hit-length” h, where h — 1 means the token 
appears in an unseen context and the language model has to rely on the unigram dis¬ 
tribution to score it, through to h = n, for which the full n-gram occurred in training 
data. Note that h is defined from the view-point of a standard n -gram model, using 
surface-level matching, and does not include decomposition. 

The results of this breakdown are reported in terms of relative cross-entropy 
reductions in Figure |3.5| and the sizes of the different subsets are summarised in 
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All Non-compounds Compounds 


Figure 3.5: Percentage relative cross-entropy reduction of FlPYLM+c with respect to the 
baseline HPYLM, broken down by test token hit-length h and compound status (defined in 
^3.4.4. 3 1 . Both are 4-gram models. Bars going down imply improvement over the baseline. 


Figure 3.6p 

The breakdown indicates that HPYLM+c improves over the baseline HPYLM 
only for test tokens appearing in fully unseen contexts (h = 1); this improvement is 
larger for compounds (-4.6%) than for non-compounds (-0.9%). This improvement 
for non-compounds with h = 1 is consistent with the fact that many compound 
heads appear as words in themselves; their prediction thus benefits from the way the 
HPYLM+c shares statistical strength across instantiations of a given head. 

The categories where the HPYLM+c performs worse than the baseline are for 
trigram hits generally, and for compounds at hit-lengths h ^ 1. 


3.4.4.4 Qualitative analysis 

For a more qualitative insight into the model performance, we did a further direct 
comparison of the HPYLM+c model and the MKN baseline by ranking test set 
compounds by the difference in probability value that each model assigns to the 
n-gram. The test compounds where the compound model does best (Table [33] top) 
are all words for which an analysis into a context-dependent head and modifiers 
should clearly be beneficial. For example, in scoring the phrases wochen vor den 

8 This analysis uses cross-entropy rather than perplexity because the dynamic range of perplexity 
varies drastically across the test subsets, which impedes interpretation when presenting results on a 
single visual scale. 
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Figure 3.6: Composition of test-set in terms of number of tokens with hit-length h and 
compound status. The total 178525 test tokens breaks down into 6074 compounds and 
172451 non-compounds, the latter including the end-of-sentence padding and OOVs, which 
are handled with mapping to an Unk symbol. 


prasidentschaftswahlen (‘weeks before the presidential elections’) and tage vor den 
parlamentswahlen (‘days before the parliamentary elections’), the head wahlen is 
having a mutually reinforcing effect. 

In contrast, we find that the cases where the MKN baseline model does best (Ta¬ 
ble [33] bottom) feature various words that are not strictly speaking compounds, and 
feature over-segmentation by the supervised compound segmentor, e.g. ging+rich 
and wissen+schaften (‘sciences’), or through the greedy splitting on hyphens, e.g. 
ki-+moon. These are words where our compound model’s smoothing is hurting per¬ 
formance, since it allocates some probability mass toward observing other modifiers 
with these heads, which in the case of these proper nouns will not happen. This is 
evidence of success on the part of our model’s underlying mechanism, but demon¬ 
strates that more care should be taken with the particular segmentation method used. 

3.4.4.5 Effect of training data size 

We performed a scaling experiment to investigate the model’s sensitivity to training 
data size. The vocabulary at each training data size is determined directly by the data 
(excluding singletons in order to learn reasonable conditional distributions involving 
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HPYLM+c better A 

gegen die umstrittene wieder+wahl 0.058 
aufbau der afghanischen sicherheits+krafte 0.036 
dessen zentralen gesichts+punkten 0.035 
in annapolis , mary+land 0.035 
wochen vor den prasidentschafts+wahlen 0.032 
dieses vertrauen nicht miss+brauchen 0.030 
fur psychiatrie und psycho+therapie 0.028 
tage vor den parlaments+wahlen 0.028 
reduktion der treibhausgas+emissionen 0.025 
in einem unblutigen militar+putsch 0.021 

Baseline (MKN) better A 

, newt ging+rich 0.5 11 
nachtlichem flug+larm 0.449 
generalsekretar ban ki-+moon 0.423 
in st. peters+burg 0.420 
im 17. jahr+hundert 0.419 
saalpublikums in st. peters+burg 0.359 
militanten klerikers moqtada al-+sadr 0.352 
un-hochkommissarin fur menschen+rechte 0.286 
schwebt in lebens+gefahr 0.231 
der akademie der wissen+schaften 0.212 


Table 3.5: Compounds from the monolingual test set for which HPYLM+c outperforms 
MKN by the largest margin (top) and vice versa (bottom). We define the margin A as the 
difference in probability that the models assign to the given test n-gram. 


the unknown word symbol). Perplexity is therefore not directly comparable from 
one size to another, hence we report relative perplexity differences in |Figure 3.7| 
The results show that HPYLM+c provides a small but consistent improvement 
over the HPYLM baseline as training data size is varied over two orders of mag¬ 
nitude. The two Bayesian models outperform the MKN baseline by an increasing 
relative margin as data size grows, and the additional benefit of HPYLM+c over 
HPYLM is additive. These trends are consistent with our intuition that productive 
compounding affects language model performance not just at the smallest of data 
sizes; extrapolation suggests consistent benefit could be derived from accounting 
for compounding could at larger data sizes as well. 
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Figure 3.7: Relative perplexity differences of FlPYLM+c against the two baseline models as 
the number of training tokens is varied from 100k to 60m tokens. 4-gram models are used 
throughout. 


3.4.5 Extrinsic machine translation evaluation 


Our intrinsic evaluation established that the HPYLM+c obtains small improvements 
over the MKN and HPYLM baseline models. This section reports on a machine 
translation experiment, where we compare the quality of the output produced by the 
translation system when using different language models. 

We measure translation quality with the automated metric Bleu (|Papineni et al., 

200ID . 

3.4.5.1 System details 


This evaluation uses the hierarchical phrase-based translation system implemented 


by the cdec decoder (Dyer et al. 2010 ). 

The English-German bitext consisted of Europarl and news commentary, filtered 
after preprocessing to exclude sentences longer than 50 tokens, resulting in 1.7 mil¬ 
lion parallel sentences; word alignments were inferred from this using the Berkeley 
Aligner (Liang et akj 2006[) and used as a basis from which to extract a synchronous 


context-free grammar (SCFG; |Chian g, 2007). 

The weights of the log-linear translation model were optimised towards the 
Bleu metric using cdec’s implementation of Minimum Error-rate Training (MERT; 


Och 2003). The development set news-test2008 (2051 sentences) was used for this 
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PPL 

Bleu 

std.dev 

MKN 

299.9 

13.9 

±0.3 

HPYLM 

294.0 

13.9 

±0.3 

HPYLM+c (A ling ) 

294.1 

13.9 

±0.1 

HPYLM+c (A inv ) 

305.5 

13.7 

±0.2 


Table 3.6: Machine translation results obtained with different language models. The last 
two columns give Bleu mean and standard deviation over three runs. For reference, we 
also show the perplexity of each (4-gram) language model on the monolingual test set. 


weight tuning. The evaluation result is Bleu measured on the test set newstest2011 
(3003 sentences, 171460 tokens). To counter the known instability of MERT+Bleu 
(C lark et al.| 2011), three repetitions were done and the results averaged. 


3.4.5.2 Language model feature function 


The independent variable in this translation experiment is the language model. The 
two baseline systems respectively use the 4-gram MKN and HPYLM baseline lan¬ 
guage models from the preceding monolingual evaluation as a feature function in 
the log-linear translation model. To test the effect of the compounding LM on trans¬ 
lation, this feature function is replaced by the HPYLM+c variant in question. Each 
translation system therefore uses a single language model feature. 


3.4.5.3 Translation results 


In terms of Bleu score, we do not find a meaningful difference between the various 
systems (Table [376] ). The system using HPYLM+c matches the two baselines, a re¬ 
sult that indicates our more expressive modelling is not sacrificing any performance 
in this task. This is an important outcome, as it means we avoid a common pitfall 
whereby a new model is proposed to target some specific phenomenon only to sac¬ 
rifice performance globally. Consistent with the intrinsic evaluation, the use of the 
left-headed model leads to a small decrease in Bleu. 
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P 

R 

F 

MKN 

25.5+1.5 

17.2+1.0 

20.5+0.8 

HPYLM 

24.5+2.4 

17.5+0.7 

20.4+0.4 

HPYLM+c (A ling ) 

27.7+2.3 

17.4+0.7 

21.3+0.4 

HPYLM+c (A inv ) 

24.0+2.6 

17.2+0.4 

20.0+0.6 


Table 3.7: Precision, Recall and F-score for compounds in the translation output, relative to 
the reference translations containing 2652 compounds in total. Standard deviations across 
the three MT repetitions are given next to the measurements. 


3.4.5.4 Compound-specific evaluation 


Next, we turn to a more fine-grained look at the translation output. The Bleu metric 
is likely to miss small improvements in translation quality. Moreover, in the refer¬ 
ence translations of the test set, only 2652 of the 72661 tokens are compounds; a 
moderate improvement in generating them is unlikely to have a big impact on the 
Bleu. 

To assess the quality of the translation system in terms of compound words, we 
use precision, recall and balanced F-score of hypothesised compounds with respect 
to compounds in the reference sentences. Specifically, let Rj (resp. Hi) be the set 
of compound words occurring in the ?' th reference sentence (resp. hypothesis). Pre¬ 
cision P and recall R is calculated through micro-averaging as 


P = 


E,|g,n«,| 

Eilttl 


and 


R = 


E,|g,n«,| 

Eil-R.1 


while the F-score is the harmonic mean of precision and recall, F\ = 


2 xPxfi 
P+R 


. These 


metrics are computed per individual system run and then averaged across repetitions 


to obtain the percentage-based results shown in Table 3.7 


The right-headed HPYLM+c obtains an average improvement in compound pre¬ 
cision of 9%-13% relative to the two baseline systems (2-3 absolute percentage 
points). It does so without a decrease in recall, hence the gain in precision is not 
achieved simply by the system being more conservative about outputting compounds 
in the first place. The left-headed HPYLM+c performs marginally below the base¬ 
line methods in this evaluation as well. 
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3.4.5.5 Qualitative analysis of translated compounds 


The translation output was further analysed to find specific examples of precision 
improvements attained by the HPYLM+c, compared to the HPYLM baseline. Three 
examples are given in Table |3.8| on p. 

In the Example 1, the baseline system only partially produced the translation of 
the input phrase prostate cancer, while the HPYLM+c system succeeded in produc¬ 
ing the full compound. 

Example 2 illustrates an improved lexical choice associated with the HPYLM+c 
system: while agenda is a valid translation of the orthographic ally equivalent En¬ 
glish input word, the more idiomatic translation tagesordnung is produced by the 
HPYLM+c system, thereby matching the reference. 

Finally, in Example 3 the baseline system incorrectly posits the compound selbst- 
mordattentate (‘suicide attacks’), while the HPYLM+c system correctly produces 
the less specific compound. 

In the first two examples, the sentential context plays to the strength of the main 
assumption of the HPYLM+c (i.e. “context generates head”), offering a potential 
explanation for the improved output. However, in the final example, this is not the 
case; the compound’s context in the hypotheses is not particularly informative, and 
the improvement has to be put down to a more complex discriminating role fulfilled 
by the language model during decoding. 

3.5 Related work 


58 


Another common approach for addressing the sparsity effects of productive com 
pounding ( |Koehn and Knight} 2003} Koehn et al. 2008 Stymne [2009; Berton et al. 


1996), and rich morphology (Habash and Sadat 2006 Geutner 1995), has been to 


use pre/post-processing with an otherwise unmodified translation system or speech 
recognition system. This approach views the existing machinery as adequate and 
shifts the focus to finding a more appropriate segmentation of words into tokens, 
i.e. compounds into parts or words into morphemes, thus achieving a vocabulary 
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reduction. The downside of such a method is that training a standard n-gram lan¬ 
guage model on pre-segmented data introduces unwanted effects: in the case of Ger¬ 
man compounds, the split-off modifiers would take precedence in the head’s n-gram 
context, and during back-off the actual word-context information is discarded first. 


As illustrated earlier in the “forecasters predicted” example in §2.2.3 the problem 
is similar when modelling sequences of morphemes as n-grams, and earlier work 
in speech recognition has shown that taking steps against this effect can improve 
recognition accuracy (Ircing et al., 2001[ ). 

Pre-processing also often requires heuristics to guard against over/under-seg¬ 
mentation, which do not generalise well to different settings or languages. The 
modelling approach introduced in this chapter is also subject to the whims of our 
compound segmentation method, but the model is more robust since it retains the 
original surface form of the word—recall that the decomposition step amounts to 
interpolated back-off. 


Baroni and Matiasek (2002) proposed basic models of German compounds for 


use in predictive text input, exploiting the same link between right-headedness and 
context as we have, although their focus was restricted to compounds with two com¬ 
ponents. 

Considering language modelling with PYPs more generally, two pieces of work 
in particular warrant discussion. Wood and Teh| (2009) generalised hierarchical 
PYPs beyond tree hierarchies to directed acyclic graph structures. They modelled 
domain adaptation by tying together multiple domain-specific HPYLMs and a latent 
“background” HPYLM. The intuition for modelling a given n-gram is that a domain- 
specific model is able to back-off either by shortening the context or by switching 
to the background model while retaining the same context. Importantly, this choice 
is available at all context lengths and is executed by using a base distribution that 
is a mixture of PYPs. In contrast, our HPYLM+c ties together a head-generating 
HPYLM (GY) and a modifier-generating HPYLM (TV) only at the level of full con¬ 
texts, and does so via a base distribution (of H ) that is product instead of a mixture. 


It could therefore be interesting to apply the graphical LM of Wood and Teh (2009) 
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to the compounding problem, as it would allow back-off through context-dropping 
and back-off through modifier-dropping at different context-lengths. 


Finally, the PYP has recently also been used to devise language models that can 


leverage the output of traditional finite-state morphological analysers. Chahuneau 


et al. (2013b) define an open-vocabulary PYP-based word-model that generates 


words from morphemes. This word-model is then integrated into an HPYLM by 
replacing the uniform base distribution Go, and performs well in their evaluation. 
Though related in the sense of also modelling sub-word elements, the HPYLM+c 
contrasts to their approach by seeking to model cross-word dependencies. 


3.6 Summary 

In this chapter we formulated a Bayesian n -gram language model capable of scor¬ 
ing compound words in terms of their components, which mitigates the data sparsity 
caused by productive compounding. We exploited the linguistic notion that the com¬ 
pound head is the most relevant part for syntactic coherence within a sentence. Our 
results on German indicate that this is a useful property to take into account, as the 
perplexity degraded when setting up the model to ignore the right-headedness of 
German. In an extrinsic evaluation in a machine translation system, the proposed 
language model contributed to outputting compound words with higher precision. 
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Chapter 4 

Rich morphology in distributed 
language models 


Chapter Abstract 


In this chapter we develop a distributed language model that composes 
morpheme vectors into word vectors so that morphologically related 
words are linked in the model. We show that this leads to substan¬ 
tial improvements in perplexity and that the acquired morpheme vec¬ 
tors successfully enable the construction of representations for unknown 
words. A further contribution is to demonstrate the full integration of a 
distributed language model into a machine translation decoder, which 
leads to improvements in the translation quality metric. Material in this 


chapter was originally published as (Botha and Blunsom 2014). 


In the previous chapter we leveraged existing morphological information in the 
form of compound segmentations to improve the smoothing of n -gram distributions. 
While this lead to some performance improvements, our results highlighted the chal¬ 
lenge in devising suitable back-off schemes that correctly prioritise the different 
types of generalisation (e.g. compound decomposition versus context truncation). 

This provides part of the motivation for shifting the focus to distributed lan¬ 
guage models (DLMs), which circumvent the need for carefully constructing back¬ 
off schemes by modelling arbitrary correlations between the feature representations 
of context words and their subsequent target word. 
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Word-based distributed representations have been found to capture some mor¬ 


phological regularity automatically (Mikolov et al. 2013b), but we contend that 
there is a case for building a priori morphological awareness into the language mod¬ 
els’ inductive bias. Conversely, compositional vector-space modelling has recently 
been applied to morphology to good effect (|Lazaridou et al.[ |2013| |Luong et al.[ 


2013), but lacked the probabilistic basis necessary for use with a machine transla¬ 


tion decoder. 

The method we propose strikes a balance between probabilistic language mod¬ 
elling and morphology-based representation learning. Word vectors are composed 
as a linear function of arbitrary sub-elements of the word, e.g. surface form, stem, 
affixes, or other latent information. The effect is to tie together the representations 
of morphologically related words, directly combating data sparsity. This is executed 
in the context of a log-bilinear (LBL) LM (Mnih and Hinton, 2007), which is sped 
up sufficiently by the use of word classing so that we can integrate the model into an 
open source machine translation decoder and evaluate its impact on translation into 
6 languages, including the morphologically complex Czech, German and Russian. 


4.1 Additive word representations 

As introduced in |c hap ter 2} a DLM associates with each word type v in the vo¬ 
cabulary V a d-dimensional feature vector r„ e W [ . Regularities among words are 
captured in an opaque way through the interaction of these feature values and a set 
of transformation weights. This leverages linguistic intuitions only in an extremely 
rudimentary way, in contrast to hand-engineered linguistic features that target very 
specific phenomena, as often used in supervised-learning settings. 

We seek a compromise that retains the unsupervised nature of DLM feature vec¬ 
tors, but also incorporates pre-existing morphological information into the model in 
a flexible and efficient manner. In particular, morphologically related words should 
share statistical strength in spite of differences in word form. 
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As a way of integrating additional information about a word type, we define a 
mapping /x : V J 71 of a surface word into a variable-length sequence of factors, 
i.e. /x(v) = (/i,... , fK( v )), where u G V, /* G J 7 and 1 < i < A'(v). A factor /, 
which could be a word’s stem or suffix, or some other relevant sub-word element, 
has an associated factor feature vector vj e Mb Under this view, the vector rep¬ 
resentation of a word v is reformulated as a function uj/fv) of its factor vectors, 
adding them together: 

r V = u} li (v)= T r f (4.1) 

/SA i ( v ) 

The vectors of morphologically related words become explicitly linked through 


shared factor vectors, for example 


imperfection 

perfectly 


im + 


perfect 


+ ion 


perfect + ly, 


where we use the notation word and factor. 

A valuable by-product of this approach is that it gives a well-defined method for 
constructing representations for out-of-vocabulary (OOV) words using the available 
morpheme vectors. 

In practice, we set up the factorisation // to also include the surface form of a 
word as a factor. This is done to account for non-compositional constructions such 
as 

greenhouse 7 ^ green + house 


and also makes the approach more robust to noisy morphological segmentations. 
The word representation for the preceding example is therefore 


greenhouse = greenhouse + green + house. 


An obvious shortcoming in the choice of addition as composition function u 
is its commutativity. Inclusion of the surface factor also overcomes the associ¬ 
ated invariance in the order of morpheme factors, with the desired outcome that 
houseboat ^ boathouse, for example. 
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The number of factors per word is free to vary across the vocabulary, making the 
approach applicable across the spectrum of more fusional languages (e.g. Czech, 
Russian) to more agglutinative languages (e.g. Turkish). This is in contrast to fac¬ 
tored language models (Alexandrescu and Kirchhoff, 2006), which assume a fixed 
number of factors per word. Their method of concatenating factor vectors to obtain 
a single representation vector for a word can be seen as enforcing a partition on the 
feature space. Our method of addition avoids such a partitioning and better reflects 
the absence of a strong intuition about what an appropriate partitioning might be. 
A limitation of our method compared to theirs is that the deterministic mapping /i 
currently enforces a single factorisation per word type, which sacrifices information 
obtainable from context-disambiguated morphological analyses. 

The additive composition function u can be regarded as an instantiation of the 
weighted addition strategy wadd that performed well in the distributional semantics 


approach to derivational morphology reported by Lazaridou et al. (2013]). Their 
weighted addition involves using a global pair of weights a and /3 to compose a 


word as ct x stem + (3 x suffix. (Their focus is limited to the segmentation of 
a word into a stem and suffix, and evaluated only on English.) Importantly, they 
fit the two weights while keeping the vector representations fixed. As described 
shortly, we situate our composition method within a distributed language model. Our 
feature representations can therefore adjust jointly with the effect of the composition. 
To a first approximation, individual feature values can thereby absorb a weighting 
component specific to a morpheme type or a feature dimension, rather than using a 
single pair of global weights. 

Another aspect of our composition method is how it models derivational ambi¬ 
guity. Unlike the recursive neural-network method of Luong et al. ( 2013[ ), we do not 
impose a single tree structure over a word, which ignores the semantic ambiguity 
inherent in words like un[[lock]able] vs. [un[lock]]able {]] The consideration is that 
a model should either account for such ambiguity directly, or operate in a way that 


'Example due to 


de Almeida and Libben 


2005). 
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treats alternative derivations uniformly so that it makes minimal assumptions about 
the language in question. Our approach is to do the latter. 

In contrast to these two recent approaches to vector-based morphological mod¬ 
elling, our additive representations are readily implementable in a probabilistic lan¬ 
guage model suitable for use in a decoder]^ 


4.2 Log-bilinear language models 


In this section we formalise the distributed language model into which we subse¬ 
quently integrate our compositional approach to morphology. 

Log-bilinear (LBL) models ( Mnih and Hinton[ 2007j ) are an instance of DLMs 
that make the same Markov assumption as n-gram language models, as introduced 
in chapter 2 The conditional distribution P(iv, \ w\ I* +1 ) of a word given its context 


is modelled by a smooth scoring function z/(-) over vector representations of words. 
The LBL predicts the vector p for the next word as a function of the context 


vectors q j e W 1 of the preceding n -1 words, 


'71—1 


P = 77 ( <*3°3 ) ’ 

W = 1 


(4.2) 


where C,- G 


p dxd 


are position-specific weight matrices, and rj is the identity func¬ 


tion 77 (x) = x, written this way for a subsequent variation. 

The continuous scoring function v(w) measures how well a word w fits the pre¬ 
dicted representation p and is defined as 


u(w) = p • r w + b w , 


(4.3) 


where r,„ e is the representation vector of the target word w and b w is a bias 
term encoding the prior probability of the word type. The conditional probability 
of the word is calculated through exponentiation and normalisation of the scoring 
function: 


P (w*K_i +1 ) = 


exp (v(wi)) 
E„ e v exp(z/(n))' 


(4.4) 


2 Our focus is on surface morphology, but the method could equally be used to integrate other 
information, e.g. PoS, lemma. 
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This model is subsequently denoted as LBL with parameters ©lbl = (Cj, Q, R, b), 
where Q, R G Ml v l xd contain the word representation vectors as rows, and b G MI v L 
Q and R imply that separate representations are used for conditioning and output. 
See Figure 4.Ta|for an illustrative diagram. 


4.3 Variants of the log-bilinear model 

4.3.1 Additive log-bilinear model 


We introduce a variant of the LBL that makes use of additive representations (§4.1) 
by associating the composed word vectors r and q ; with the target and context words, 
respectively. We redefine the representation matrices to be Q^\ R^ G M^ x,/ thus 
containing a vector for each factor type in the global vocabulary of factors J~. This 
model is designated LBL++ and has parameters ©lbl++ = R^\ b). See 


the example in Figure 4.1c 


Words sharing factors are tied together, which is expected to improve perfor¬ 
mance on rare word forms. Note that the composed word vectors are not model 
parameters in their own right, but are strictly a function of the factor vectors ac¬ 
cording to the composition function u and mapping //. The scoring function for this 
model is an expansion of Equations |4.2f|4~3l to include the additive representations: 

' n—1 


VLBL++{W) = rj ( jCj 
d=i 

'n—1 


+ K 


d =1 f€v(wj) 


’> £(£ q/JG- ' E r /)+ b 




(4.5) 

(4.6) 


Similar to Equation 4.4 the probability model is 

P (wi\w. 


j_l \ _ exp (^LBL ++ (Wj)) 
l ~ n+1 E„ev ex P( zy LBL ++ (v))' 


(4.7) 


Representing the mapping /i with a sparse transformation matrix M G 
establishes the relationship between composed word and factor representation ma¬ 
trices as R = MR (7) and 0 = MQ^\ A row in M selects the factor vectors for 
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(»Tin»)r 
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v(skolu) 


(a) LBL 
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cl7 


P looool • -boob: t(c17) 


v(skolu) 

(b) CLBL 


pro_ P_ r _°SlM 9j _ pro P_ r °STM *h 


[ooooj + [ooooj [••••] 


(©oooj + [ooooj (••••) 

C l 




+ 


+ 


("ooool + ("ooool + ("ooooj -*(••••{ 

novou noVqjM o^iif cj? 

C 2 

(ooooj + [©oooj + [ooooj (••••] 

novou nov SJM ou $UF q 2 

c 2 


l 

P [M«» | 


1 Scly 

p looool • SS® -» t(c17) 


+ ' • • • • + 

skolu skol STM 


(• •• *Ll |»»»ij r 

U SUF * 

v(skolu) 


+ f 0 O G o) + \Q G • #"] (»»»»] r 

skolu skoT STM u SUF i 

v(skolu) 


(c) LBL++ (d) CLBL++ 

Figure 4.1: Illustration of how three different 3-gram model variants treat the Czech phrase 
pro novou skolu (‘for [the] new school’), where the preposition causes accusative case mark¬ 
ings on the adjective and noun. In (b) and (d), we assume the vocabulary clustering ( ^4.3.2 1 
assigned class 17 to the target word. 


a given word type from the (or Q^) matrix. These row-vectors are a gener¬ 
alisation of one-hot vectors: they have multiple non-zero entries, corresponding to 
the multiple factors that make up a word, and integer values greater than one would 
encode multiple occurrences of the same factor within a given word. In practice, 
we exploit this correspondence between composed word representations and factor 
representations for test-time efficiency—the composed word vectors are compiled 
offline so that the computational cost of LBL++ probability lookups is equivalent to 
that of LBL. 

We consider two obvious variations of the LBL++ to evaluate the extent to which 
interactions between context and target factors affect the model: LBL+o only fac¬ 
torises output words and retains simple word vectors for the context (i.e. Q = Q^), 
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while LBL+c does the reverse, only factorising context words 0 Both reduce to the 
LBL when setting /i to be the identity function, such that V = T. 

The factorisation permits an approach to handling unknown context words that 
is less harsh than the standard method of replacing them with a global unknown 
symbol—instead, a vector can be constructed from the known factors of the word 
(e.g. the observed stem of an unobserved inflected form). A similar scheme can be 
used for scoring unknown target words, but requires changing the event space of the 
probabilistic model. We use this vocabulary stretching capability in our word simi¬ 
larity experiments, but leave the extensions for test-time language model predictions 
as future work. 

A consequence of the additional free parameters introduced by the additive rep¬ 
resentations is that a maximum-likelihood training criterion would favour the degen¬ 
erate solution where the non-surface vectors in T \ V are pushed to zero, rendering 
the additive extension ineffective. Including a regularisation term in the training ob¬ 
jective is crucial and should yield non-trivial values for the parameters tied through 


our additive representations. (Figure 4.2 in the experimental section demonstrates 
the sensitivity of performance to the regularisation strength.) 


4.3.2 Class-based model decomposition 

DLMs are known to achieve much lower perplexities than conventional n-gram 
models, but their application to decoder-based tasks like translation and speech 
recognition have largely been limited to rescoring of lattices or n-best lists of hy¬ 
potheses. However, they could exercise a greater influence in those tasks if inte¬ 
grated into the decoders. 

The key obstacle to using DLMs in a decoder is the expensive normalisation over 


the vocabulary (Equation 4.4). Our approach to reducing the computational cost 
of normalisation is to use a class-based decomposition of the probabilistic model 


(Goodman, 2001 Mikolov et al. 2011). Using Brown-clustering (Brown et al. 


3 The +c, +0 and ++ naming suffixes denote these same distinctions when used with the CLBL 
model introduced later. 
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1992)7] we partition the vocabulary into \C\ classes, denoting as C c the set of vocab¬ 
ulary items in class c, such that V = C\ U • • ■ U C\c \. This is a hard clustering, which 
we treat as fixed and observed for a given vocabulary and dataset. The class c(w) of 
a word w G V can therefore be looked up. We emphasise that we are working in a 
discriminative setting. 

In this model, the probability of a word conditioned on the history h of n — 1 
preceding word^is decomposed as 


P{w\h) =^ p {c\h)P{w\ h ^) 


(4.8) 


c'eC 


= P(c(w)\h)P(w\h,c(w))+ ( p ( c '\ h ) -°) (4.9) 

d £C ,d ^c(w) 

— P(c(w)\h)P(w\h, c(w)). (4.10) 


This class-based model, CLBL, extends the LBL by associating a representation vec¬ 
tor s c and bias parameter t c with each class c, such that ©clbl = ( Cj,Q, R , S , b, t). 
The same prediction vector p is used to compute both class score r(c) = p • s c + t c 
and word score v(w), which are normalised separately: 

exp (t(c)) 


P{c\h) = 


P(w\h, c) = 


|c| 

Eexp(T(c')) 

c'=1 

exp (v(w)) 

E exp(z/(u'))' 
v'&Cc 


(4.11) 


(4.12) 


Figure 4.1b provides an illustration. 

We favour this flat vocabulary partitioning for its computational adequacy, sim¬ 
plicity and robustness. Computational adequacy is obtained by using \C\ ~ |V| 0 5 , 
thereby reducing the C9(|V|) normalisation operation of the LBL to two 0{\V\ {] ~ y ) 
operations in the CLBL, as Brown clustering mostly yields balanced class sizes. 


4 tn preliminary experiments, Brown clusters gave much better perplexities than frequency- 
binning (Mikolov et al. 201 lb. 


5 This is therefore not a class-based language model in the sense of predicting the next class 


conditioned on the preceding classes (Brown et al. 1992). 
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Other methods for achieving more drastic complexity reductions exist in the 
form of frequency-based truncation, shortlists (Schwenkj 2004), or casting the vo¬ 


cabulary as a full hierarchy (Mnih and Hinton 2008) or partial hierarchy (Le et al. 


2011] ). We expect these approaches could have adverse effects in the rich morphol¬ 
ogy setting, where much of the vocabulary is in the long tail of the word distribution. 

4.3.3 Non-linearity 

The use of non-linear activation functions in DLMs are known to improve perfor¬ 
mance over strictly linear ones (Mnih et al. 2009 Pachitariu and Sahani] 2013), but 
comes at a cost. Typical sigmoidal and tanh non-linearities shrink the gradients, 
which causes slower convergence during training than the linear models (~20 vs. ~4 


iterations in our experience with CLBL, § |4.3.2| ). Non-linearities can create difficul¬ 
ties during training (Pascanu et al., 2012) and incur a computational cost at test-time 
that one would wish to avoid when using language models in a decoder. Recti¬ 


fied linear units (Nair and Hinton 2010) have been used as a workaround (Vaswani 


et al. 2013), but a pertinent question is whether a simple technique like our additive 


representations might serve as a substitute for non-linearities altogether^] 

For comparison, we define a non-linear variant nLBL that uses a sigmoidal ac¬ 
tivation function on the prediction vector, i.e. rj = a (cf. Equation 4.2[), which 


operates element-wise as a(x) = (1 + e x ) 1 . Our definition here is simpler than 


the various options explored by |Mnih et alT] (2009) in that it only applies to the final 
prediction vector. 


4.4 Experiments 

The overarching aim of our evaluation is to investigate the effect of using the pro¬ 
posed additive representations across languages with a range of morphological com¬ 
plexity. 

6 This consideration is limited to single layer networks—in multi-layer networks, non-linearities 
take on a defining role. 
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Our intrinsic language model evaluation has two parts. We first perform a model 
selection experiment on small data to consider the relative merits of using additive 
representations for context words, target words, or both, and to validate the use of 
the class-based decomposition. 

Then we consider class-based additive models trained on tens of millions of 
tokens and large vocabularies. These larger language models are applied in two 
extrinsic tasks: i) a word-similarity rating experiment on multiple languages, aiming 
to gauge the quality of the induced word and morpheme representation vectors; ii) a 
machine translation experiment, where we are specifically interested in testing the 
impact of an LBL LM feature when translating into morphologically rich languages. 

4.4.1 Training and initialisation 

Model parameters 0 are estimated by optimising an L2-regularised log likelihood 
objective. Training the CLBL and its additive variants directly against this objective 
is fast because normalisation of model scores, which is required in computing gra¬ 
dients, is over a small number of events, namely classes, and word types in a given 
class. 

For the classless LBLs we use noise-contrastive estimation (NCE) (Gutmann and 

2012) to avoid normalisation during training. NCE 
achieves this by casting the problem as one of discriminating between real train¬ 
ing instances and synthetically generated “noise” instances. For each unigram in 
the data, we generated 100 noise instances from the empirical unigram distribution. 
While NCE speeds up training, it leaves the expensive test-time normalisation of 
LBLs unchanged, precluding their usage during decoding. 

Bias terms b (resp. t) are initialised to the log unigram probabilities of words 
(resp. classes) in the training corpus, with Laplace smoothing, while all other param¬ 
eters are initialised randomly according to sharp, zero-mean Gaussians. Represen¬ 
tations are thus learnt from scratch and not based on publicly available embeddings, 
meaning our approach can easily be applied to many languages. 


Hyvarinen, 2012; Mnih and Teh, 
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Data- 

Tokens 

lM 

|V| 

Tokens 

Data-Main 

V Sentence Pairs 

Cs 

lm 

46k 

16.8m 

206k 

0.7m 

De 

lm 

36k 

50.9m 

339k 

1.9m 

En 

lm 

17k 

19.5m 

60k 

0.7m 

Es 

lm 

27k 

56.2m 

152k 

2.0m 

Fr 

lm 

25k 

57.4m 

137k 

2.0m 

Ru 

lm 

62k 

25.1m 

497k 

1.5m 


Table 4.1: Coipus statistics. The number of sentence pairs for a row X refers to the 
English —>X parallel data (but row En has Czech as source language). 


Optimisation is performed by stochastic gradient descent with updates after each 
mini-batch of L training examples. We apply AdaGrad (Duchi et al.; 2011| ), which 
effectively adjusts the learning rate individually for each model parameter as a func¬ 
tion of a global step-size hyperparameter £, which we tune on held-out development 
data. The number of training iterations / is also treated as a hyperparameter, and 
training is halted once the perplexity on the development data starts to increase^ 


4.4.2 Data and methods 

We make use of data from the 2013 ACL Workshop on Machine Translation^ We 
first describe data used for translation experiments, since the monolingual datasets 
used for language model training were derived from that. The language pairs are 
English—)•{German, French, Spanish, Russian} and English-H-Czech. Our parallel 
data comprised the Europarl-v7 and news-commentary corpora, except for English- 
Russian where we used news-commentary and the Yandex parallel corpus^] Pre¬ 
processing involved lowercasing, tokenising and filtering to exclude sentences of 
more than 80 tokens or substantially different lengths. 

4-gram language models were trained on the target data in two batches: Data- 
1m consists of the first million tokens only, while Data-Main uses the full target- 


side of the corpora listed above. Statistics are given in Table 4.1 newstest2011 was 


7 L=10k-40k, £=0.05-0.08, dependent on | V| and data size; /=2-4 passes through the data for the 
linear models. 

s http://www.statmt.org/wmtl3/translation-task.html 

9 https://translate.yandex.ru/corpus?lang=en 
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used as development dat^for tuning language model hyperparameters, while in¬ 
trinsic LM evaluation was done on newstest2012, measuring perplexity. In addition 
to contrasting the LBL variants, we also compare against the MKN baseline. 


4.4.2.1 Language model vocabulary 


Additive representations that link morphologically related words specifically aim to 
improve modelling of the long tail of the lexicon, so we do not want to prune away 
all rare words, as is common practice in language modelling and word embedding 
learning. We define a singleton pruning rate n, and randomly replace a fraction n of 
words occurring only once in the training data with a global Unk symbol. We use 
low pruning rateand thus model large vocabularies f^] 

4.4.2.2 Word factorisation // 


We obtain morphological segmentations from the unsupervised segmentor Morfes- 
sor Cat-MAP (Creutz and Lag us] [2007] ), which takes as input a list of word forms 
and their frequencies in a corpus. The output consists of the segmentation points 
along with labels that identify each segment as a prefix, stem or suffix. For example, 
the word unsegmented is analysed as un PRE segment STM ed SUF . 

The mapping p of a word is taken as its surface form and the morphemes identi¬ 
fied by Morfessor. Keeping the morpheme labels allows the model to learn separate 
vectors for, say, in STM the preposition and in PRE occurring as “in-appropriate”. By not 
post-processing segmentations in a more sophisticated way (cf.[Luong et al.|( |2013| )) 


we keep the overall method language independent. Table 4.2 provides an example 
from the word map derived for our Czech data set. 

The choice of an unsupervised method of segmentation in this chapter stands in 


contrast to the supervised compound segmentor used in chapter 3 The evaluation 


10 The newstest2011 data set does not feature Russian, so for that language some training data was 
held out for tuning. 

11 Data-1m: k = 0.2; Data-Main: k = 0.05 

12 We also mapped digits to 0, (i.e. 15.1%=>00.0%), and cleaned the Russian data by replacing 
tokens having <80% Cyrillic characters with Unk. 
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Word v 

Factors (/i,..., 

aktivovanymi 

aktivovanymi aktiv STM ova SUF ny SUF mi SUF 

cerpany 

cerpany cerpdsrM nysuF 

dorozumfvat 

dorozumfvat do PRE rozum S TM isuf vatsjM 

modlitba 

modlitba modlitba S TM 

nejvaznejsf 

nejvaznejsf nej PRE vaz S TM uejsf SUF 

nepodkopal 

nepodkopal ne PRE podkopal SUF 

skola 

skola skolsTM a^uF 

skolou 

skolou skolsTM OUsUF 

skolu 

skolu skolsTM UsiJF 

skoly 

skoly skolsTM ysuF 


Table 4.2: A fragment of the word map p(y) obtained for Czech. 


Data-1m 

Data-Main 

Cs 

2.27 

1.92 

De 

2.81 

2.98 

En 

1.72 

1.50 

Es 

2.03 

1.71 

Fr 

2.26 

1.78 

Ru 

2.32 

1.84 


Table 4.3: Average number of morphemes per word type obtained with Morfessor for each 
language (see §4.4. 2.2 1. These numbers reflect the segmentation only and do not include the 
surface factors. 


setup there was simple given the focus on one language, so that there was no down¬ 
side in relying on a supervised segmentor. In this chapter, we aim to have a broader 
evaluation on multiple languages. The use of an unsupervised tool such a Mor¬ 
fessor simplifies the empirical work by not requiring special models or calibration 
for each new language, even though this approach may affect segmentation quality 
adversely. The LBL++ models are specifically devised to be robust against noisy 
segmentation. In the experiments that follow, we set the single perplexity threshold 
parameter of Morfessor to 50 and 400 for Data- 1m and Data-Main, respectively, 
based on prior experience. Segmentation quality was gauged informally by manual 


inspection of the output. Table 4.3 provides statistics that summarise the degree of 
segmentation obtained in each language. 


72 













-MKN O LBL++ CLBL++ 

- X- LBL 0 CLBL 


Figure 4.2: Relative performance of models is sensitive to regularisation strength A when 
using lm training tokens. Perplexities are measured on the devset newstest2011. 

4.4.3 Model selection 

Results on Data-Im. The LBL training objectives include a weight penalty term 
0.5A||^|| 2 , which we tune against the newstest2011 dataset. For the lm-word corpus 
size, we find tuning A to be vital for good model performance (Figure |4~2| ). 

The use of morphology-based, additive representations for both context and out¬ 
put words (models++) yielded perplexity reductions on all 6 languages when using 


lm training tokens (Table 4.4). Furthermore, these double-additive models consis¬ 
tently outperform the ones that factorise only context (+c) or only output (+o) words, 
indicating that context and output contribute complementary information and sup¬ 
porting our hypothesis that is it beneficial to model morphological dependencies 


across words. The results are given by language in Table 4.4 and summarised visu¬ 
ally in Figure |4.3[ 

The impact of our CLBL++ method varies by language, correlating with vocabu¬ 
lary size: Russian benefited most, followed by Czech and German. Even on English, 
often regarded as having simple morphology, the relative improvement is 4%. 

The relative merits of the +c and -t-o schemes depend on which model is used as 
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MKN 

CLBL 

+c 

+0 

++ 

++ 

nCLBL 

+c 

+0 

++ 

++ 

Cs 

550 

508 

-4.0% 

-3.4% 

-7.4% 

470 

487 

-3.0% 

0.7% 

-4.6% 

465 

De 

366 

316 

-3.5% 

-1.0% 

-4.3% 

303 

309 

-2.8% 

-0.6% 

-4.2% 

296 

En 

330 

303 

-1.6% 

-1.3% 

-3.8% 

291 

301 

-2.6% 

0.1% 

-3.0% 

292 

Es 

241 

212 

-3.6% 

-2.3% 

-4.9% 

202 

209 

-2.1% 

-0.2% 

-4.4% 

200 

Fr 

274 

239 

-2.5% 

-1.5% 

-5.5% 

226 

230 

-1.6% 

-0.7% 

-2.4% 

225 

Ru 

396 

343 

-8.0% 

-3.8% 

-11.4% 

304 

353 

-5.4% 

-0.6% 

-8.3% 

324 

avg 

360 

320 

-3.9% 

-2.2% 

-6.2% 

299 

315 

-2.9% 

-0.2% 

-4.5% 

300 


Table 4.4: Test-set perplexities for Data-1m, measuring the separate effects of context- 
only (+c), output-only (+b) and double (++) additive representations when applied to the 
linear and non-linear class-based LBL. Additive model results are also given in terms of 
percentage reduction in perplexity relative to the corresponding non-additive model. 


starting point. With LBL, the output-additive scheme (LBL+o) gives larger improve¬ 
ments than the context-additive scheme (LBL+c) (Figure 43]). The reverse is true 
for CLBL, indicating the class decomposition dampens the effectiveness of using 
morphological information in output words. 

The non-linear class-based model nCLBL obtains improvements up to 4% over 
its linear counterpart, except for Russian where it performs worse. Significantly, we 
observe that our purely linear CLBL++ outperforms the non-linear nCLBL model, 
establishing an alternative to the non-linear model. Using additive representations 
with the non-linear model further decreases perplexity slightly, though the relative 
improvements obtained when using them for output words only are very small. 

The use of classes increases perplexity slightly compared to the LBLs (see Ta¬ 
ble [43]), but this is in exchange for much faster computation of language model prob¬ 
abilities, allowing the CLBLs to be used in a machine translation decoder (§ |4.4.7| ). 
On average, the CLBLs obtain a 35-fold speed-up over the LBLs. The running time 
of a model was calculated by measuring the total wall-time it took to compute the 
test-set perplexity, and then averaging across languages. For example, the LBL com¬ 
puted the test-set perplexity in 209 seconds, while the CLBL did so in 6 seconds. 
The test sets contain 72650 tokens on average. 
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Figure 4.3: Model selection results. Box-plots show the spread, across 6 languages, of rel¬ 
ative peiplexity reductions obtained by each type of additive model against its non-additive 
baseline, for which median absolute perplexity is given in parentheses; for MKN, this is 348. 
Each box-plot summarises the behaviour of a model across languages. Circles give sample 
means, while crosses show outliers beyond 3x the inter-quartile range. 



MKN 

360 


LBL 

317 

CLBL 

320 

LBL+c 

309 

CLBL+c 

307 

LBL+o 

299 

CLBL+o 

313 

LBL++ 

290 

CLBL++ 

299 


Table 4.5: The role of classes. Test-set perplexity averaged across languages, using Data- 
1M for training. 

4.4.4 Intrinsic evaluation 

Results on DATA-MAIN. Based on the outcomes of the small-scale evaluation, 
we focus our main language model evaluation on the additive class-based model 
CLBL++ in comparison to CLBL and MKN baselines, using the larger training dataset, 
with vocabularies of up to 500k word types. 

The overall trend that morphology-based additive representations yield lower 
perplexity carries over to this larger data setting, again with the biggest impact being 
on Czech and Russian (Table |4~6| top). Perplexity improvements are in the 2%-6% 
range, slightly lower than the corresponding differences on the small data. 

Our hypothesis is that much of the improvement is due to the additive representa¬ 
tions being especially beneficial for modelling rare words. We test this by repeating 
the experiment under the condition where all word types occurring only once are 
excluded from the vocabulary (k=\ ). If the additive representations were not benefi- 
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CLBL CLBL++ 




MKN 

abs 

rell 

abs 

rel2 

k=0.05 

Cs 

862 

683 

-20.8% 

643 

-5.9% 


De 

463 

422 

-8.9% 

404 

-4.2% 


En 

291 

281 

-3.4% 

273 

-2.8% 


Es 

219 

207 

-5.7% 

203 

-1.9% 


Fr 

243 

232 

-4.9% 

227 

-1.9% 


Ru 

390 

313 

-19.7% 

300 

-4.2% 


avg 

411 

356 

-10.6% 

342 

-3.5% 

K=1.0 

Cs 

634 

477 

-24.8% 

462 

-3.1% 


De 

379 

331 

-12.6% 

329 

-0.9% 


En 

254 

234 

-7.6% 

233 

-0.7% 


Es 

195 

180 

-7.7% 

180 

0.02% 


Fr 

218 

201 

-7.7% 

198 

-1.3% 


Ru 

347 

271 

-21.8% 

262 

-3.4% 


avg 

338 

282 

-13.7% 

277 

-1.6% 


Table 4.6: Test-set perplexities for Data-Main using two vocabulary pruning settings. Per¬ 
centage reductions are relative to the preceding model, e.g. the first Czech CLBL improves 
over MKN by 20.8% (rell); the CLBL++ improves over that CLBL by a further 5.9% (rel2). 


cial to rare words, the outcome should remain the same. Instead, we find the relative 
improvements become a lot smaller (Table |4.6| bottom) than when only excluding 
some singletons (k= 0.05), which supports that hypothesis. 

Model perplexity on a whole dataset is a convenient summary of its intrinsic per¬ 
formance, but such a global view does not give much insight into how one model 
outperforms another. To focus more directly on the source of improvements, we par¬ 
tition the test data into subsets of interest and measure perplexity over these subsets. 

4.4.4.1 Analysis by frequency 

We first partition on token frequency, as computed on the training data. Figure |4.4| 
provides further evidence that the additive models have most impact on rare words 
generally, and not only on singletons. Czech, German and Russian see relative per¬ 
plexity reductions of 8%-21% for words occurring fewer than 100 times in the train¬ 
ing data. Reductions become negligible for the high-frequency tokens. These tend 
to be punctuation and closed-class words, where any putative relevance of morphol- 
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Figure 4.4: Perplexity reductions by token frequency, CLBL++ relative to CLBL. Dotted 
bars extending further down are better. A bin labelled with a number x contains those test 
tokens that occur y £ [HP. 10' :+1 ) times in the training data. Striped bars give percentage 
test-set coverage for each bin. 


Cs Ru 



De Es 


ogy is overwhelmed by the fact that the predictive uncertainty is very low to begin 
with (absolute PPL<10 for the highest frequency subset). For the morphologically 
simpler Spanish case, perplexity reductions are generally smaller across frequency 
scales. 


4.4.4.2 Analysis by part of speech 


We also break down perplexity reductions by part of speech tags, focusing on Ger¬ 
man. We used the decision tree-based tagger of Schmid and Laws ( |2008[ ), which 
reportedly has a tagging accuracy of 91%. Aside from unseen tokens, the biggest im¬ 
provements are on nouns and adjectives (Figure [43] ), suggesting our segmentation- 
based representations help abstract over German’s productive compounding. 

German noun phrases require agreement in gender, case and number, which 
are marked overtly with fusional morphemes, and we see large gains on such test 
n-grams: 15% improvement on adjective-noun sequences, and 21% when consid- 
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Figure 4.5: Peiplexity reductions by part of speech, CLBL++ relative to CLBL on German. 
Dotted bars extending further down are better. Tokens tagged as foreign words or other 
opaque symbols resort under “Rest”. Striped bars are as in Figure [4~4| 




|V|=206k 


CLBL++ vs. MKN 

—o— 

CLBL++ vs. CLBL 


■-X-■ |War| 

O CLBL++ vs. MKN 
CLBL++ vs. CLBL 


Figure 4.6: Relative perplexity reductions obtained when varying the Czech training data 
size (16m-128m). In the first batch, the vocabulary was first held fixed to the set V as 
obtained for Data-Main (squares & circles). In the second batch, the vocabulary V va r 
varied according to the data set of the given size (diamonds & triangles). 

ering the more specific case of adjective-adjective-noun sequences. An example of 
the latter kind is “der ehemalig-e sozial-ist-isch-e bildung-s-minister” (the former 
socialist minister of education), where the morphological agreement surfaces in the 
repeated e-suffix. 
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4.4.43 Scaling 


We conducted a final scaling experiment on Czech by training models on increas¬ 
ing amounts of data from the monolingual news corpora. Improvements over the 
MKN baseline decrease, but remain substantial at 14% for the largest setting when 
allowing the vocabulary to grow with the data. Maintaining a constant advantage 
over MKN requires also increasing the dimensionality d of representations (jMikolov 


et al., 2013a), but this was outside the scope of our experiment. Although gains from 


the additive representations over the CLBL diminish down to 2%-3% at the scale of 
128m training tokens (Table |4.6[ ), these results demonstrate the tractability of our 
approach on large vocabularies of nearly lm types. 


4.4.5 Application to word similarity rating 

In the previous section, we established the positive role that morphological aware¬ 
ness played in building DLMs that better predict unseen text. Here we focus on the 
quality of the word representations learnt in the process. We evaluate on a standard 
word similarity rating task, where one measures the correlation between cosine- 
similarity scores for pairs of word vectors and a set of human similarity ratings. An 
important aspect of our evaluation is to measure performance on multiple languages 
using a single unsupervised, model-based approach. 

Morpheme vectors from the CLBL++ enable handling OOV test words in a more 
nuanced way than using the global unknown word vector. In general, we compose a 
vector u v = [q„; r„] for a word v according to a post hoc word map //' by summing 
and concatenating the factor vectors r j and q f, where / e p!(v) (T T. This ignores 
unknown morphemes occurring in OOV words, and uses [quNK5 r UNK ] for u UNK only 
if all morphemes are unknown p] 

To see whether the morphological representations improve the quality of vec¬ 
tors for known words, we also report the correlations obtained when using the 
CLBL++ word vectors directly, resorting to u UN k for all OOV words v ^ V (denoted 

^Concatenation of context and target vectors performed better overall than using either type indi¬ 
vidually, suggesting the two representation spaces capture complementary information. 
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compose ” in the results). This is also the strategy that the baseline CLBL model 
is forced to follow for OOVs. 


We evaluate first using the English rare-word dataset (Rw) created by Luong 


et al. (2013). Its 2034 word pairs contain more morphological complexity than other 


well-established word similarity datasets, e.g. crudeness—impoliteness. We com¬ 
pare against their context-sensitive morphological recursive neural network (csm- 


RNN), using Spearman’s rank correlation coefficient, p. Table 4.7 shows our model 
obtaining a p-value between the two csmRNN results, depending on which baseline 
word embeddings they used to initialise the csmRNN. 

This is a strong result given that our vectors come from a simple linear proba¬ 
bilistic model that is also suitable for integration directly into a decoder for transla¬ 


tion (§4-4.7) or speech recognition, which is not the case for csmRNNs. Moreover, 
the csmRNNs were initialised with high-quality, publicly available word embed¬ 


dings trained over weeks on much larger corpora of 630m-990m words (Collobert 


and Weston 2008 1 Huang et al.[ 2012), in contrast to ours that are trained from 


scratch on much less data. This renders our method directly applicable to languages 
which may not yet have those resources. 

A minor difference in configuration is that our vectors are 100-dimensional as 
opposed to the csmRNNs’ 50 dimensional ones. Using embedding^] from the hier¬ 
archical LBL ( |Mnih and Hinton 2008), we verified that the doubling in dimension¬ 
ality increased p by less than 0.02 absolute, so this difference should be negligible 
in our evaluation. 

Relative to the CLBL baseline, our method performs well on datasets across four 
languages. For the English Rw, which was designed with morphology in mind, the 


gain is 64%. But also on the standard English WS353 dataset (Finkelstein et al. 


2002), we get a 26% better correlation with the human ratings. On German, the 


CLBL++ obtains correlations up to three times stronger than the baseline, and 39% 
better for French (Table [478] ) . 


14 


Source: http://metaoptimize.com/projects/wordreprs 
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(Luong et al. 


2013) 


Our models 


HSMN 

2 

CLBL 

18 

HSMN+csmRNN 

22 

CLBL++ 

30 

C&W 

27 

—compose 

20 

C&W+csmRNN 

34 




Table 4.7: Word-pair similarity task (English), showing Spearman’s px 100 for the correla¬ 
tion between model scores and human ratings on the English Rw dataset. The csmRNNs 
benefit from initialisation with high quality pre-existing word embeddings, while our mod¬ 
els used random initialisation. HSMN refers to the embeddings from (Huang et al. 2012); 
C&W to those from ( [Collobert and Weston] |2008[ i. 


Datasets 0 

WS 

Gur 

RG 

ZG 

En 

Es 

De 

En 

Fr 

De 

HSMN 

63 

— 

— 

63 

— 

— 

+csmRNN 

65 

- 

- 

65 

- 

- 

CLBL 

32 

26 

36 

47 

33 

6 

CLBL++ 

39 

28 

56 

41 

45 

25 

—compose 

40 

27 

44 

41 

41 

23 

# pairs 

353 

350 

65 


222 


Table 4.8: Word-pair similarity task in multiple languages, showing Spearman’s px 100 and 
the number of word pairs in each data set. As benchmarks, we include the best results from 
Luong et al. (2013|), who relied on more data and pre-existing embeddings not available in 


ah languages. In the penultimate row our model’s ability to compose vectors for OOV words 
is suppressed. 


4.4.6 Qualitative analysis of morpheme vectors 

A visualisation of the English morpheme vectors (Figure |4.7| ) suggests the model 
captured non-trivial morphological regularities: noun suffixes relating to persons 
(writer, humanists) lie close together, while being separated according to number; 
negation prefixes share a region (un-, in-, mis-, dis-); and relational prefixes are 
grouped (surpa-, super-, multi-, intra-), with a potential explanation for their sepa¬ 
ration from inter- being that the latter is more strongly bound up in lexicalisations 
(international, intersection). Some clusters abstract over spelling variation (-ise vs. 
-ize), while others share similarity in terms of the stems they would attach to, e.g. 


15 Es WS353 ( 

Hassan and Mihalceal 2009); Gur350 (Gurevych, 2005 ); RG65 ( 

Rubenstein and 

Goodenough 

1965) with Fr (Joubarne and Inkpen 

2011 j); ZG222 (Zesch and Gurevych 2006). 
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Figure 4.7: Visualisation of the morpheme vectors learnt by CLBL++ on English data, using 
t-SNE (van der Maaten and Hinton, 2008) for dimensionality reduction. Shading added for 
emphasis. 


global(-ize,-izing,-ization). 

4.4.7 Application to machine translation 

The final aspect of our evaluation focuses on the integration of class-decomposed 
log-bilinear models into a machine translation system. To the best of our knowl¬ 
edge, this is the first study to investigate large vocabulary normalised DLMs inside 
a decoder when translating into a range of morphologically rich languages. We 
consider 5 language pairs, translating from English into Czech, German, Russian, 
Spanish and French. 


Aside from the choice of language pairs, this evaluation diverges from Vaswani 
jet al. (|2013j) by using normalised probabilities, a process made tractable by the class 


based decomposition and caching of context-specific normaliser terms. Vaswani 
et al. (12013}) relied on unnormalised model scores for efficiency, but do not report 


on the performance impact of this assumption. It is not clear how reliable an unnor¬ 
malised language model score is as a feature that must help the translation model 
discriminate between alternative hypotheses, as frequency-effects would give rise to 
scores of very different scales. 


We use cdec (Dyer et al. 2010 2013}) to build symmetric word-alignments and 
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MKN 

CLBL 

CLBL++ 

En—^C s 

12.6 

(0.2) 

13.2 (0.1) 

13.6 

(0.04) 

De 

15.7 

(0.1) 

15.9 (0.2) 

15.8 

(0.4) 

Es 

24.7 

(0.4) 

25.5 (0.5) 

25.7 

(0.3) 

Fr 

24.1 

(0.2) 

24.6 (0.2) 

24.8 

(0.5) 

Ru 

15.9 

(0.2) 

16.9 (0.3) 

17.1 

(0.1) 

Cs— s-En 

19.8 

(0.4) 

20.4 (0.4) 

20.4 

(0.5) 


Table 4.9: Bleu scores (case-insensitive) obtained on newstest2013, with standard devi¬ 
ation over 3 MERT runs given in parentheses. The two right-most columns use the listed 
DLM as a feature in addition to the MKN feature, i.e. these MT systems have at most 2 LMs. 
Language models are from Table [476] (top). 


extract rules for hierarchical phrase-based translation (Chiang 2007). Our base¬ 
line system uses a standard set of features in a log-linear translation model. This 


includes a baseline 4-gram MKN language model, trained with SRILM (Stolcke 


2002) and queried efficiently using KenLM (Heafieldj, 2011). The DLMs are in¬ 


tegrated directly into the decoder as an additional feature function, thus exercising 
a stronger influence on the search than in n-best list rescoring. Translation model 


feature weights are tuned with MERT ( Och, 2003) on newstest2012. 

Table |4.9 summarises our translation results. Inclusion of the CLBL++ language 
model feature outperforms the MKN-only baseline systems by 1.2 Bleu points for 
translation into Russian, and by 1 point into Czech and Spanish. The En—s-De sys¬ 
tem benefits least from the additional DLM feature, despite the perplexity reductions 
achieved in the intrinsic evaluation. In light of German’s productive compounding, 
it is conceivable that the bilingual coverage of that system is more of a limitation 
than the performance of the language models. 

On the other languages, the CLBL adds 0.5-1 Bleu points over the baseline, 
whereas additional improvement from the additive representations lies within MERT 
variance except for En—*C s. 
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4.5 Related work 


Factored language models (FLMs) have been used to integrate morphological in¬ 
formation into both discrete n-gram LMs (Bilmes and Kirchhoff] 2003) and DLMs 


(Alexandrescu and Kirchhoff 2006) by viewing a word as a set of factors. [Alexan- 


drescu and Kirchhoff (2006) demonstrated how factorising the representations of 


context-words can help deal with out-of-vocabulary words, but they did not evalu¬ 
ate the effect of factorising output words and did not conduct an extrinsic evalua¬ 
tion. The FLM notion has since been used to integrate part-of-speech information 


into recurrent neural-network language models for ASR (Wu et al.,[2012) and code¬ 
switching ( |Adel et afj|2013[ ). 

A variety of strategies have been explored for bringing DLMs to bear on ma¬ 
chine translation. Rescoring lattices with a DLM proved to be beneficial for ASR 


(Schwenk, 2004) and was subsequently applied to translation (Schwenk et al. 2006 


Schwenk and Koehn] 2008|), reaching training sizes of up to 500m words (Schwenk 


et al. 2012). For efficiency, this line of work relied heavily on small “shortlists” 


of common words, by-passing the DLM and using a back-off n-gram model for the 
remainder of the vocabulary. Using unnormalised DLMs during first-pass decoding 


has generated improvements in Bleu score for translation into English (Yaswani 
etakl [20T3] ). 


More indirect approaches have approximated DLMs by sampling text from them 
and then training conventional n-gram models on that (Allauzen et al.] 2013]), or by 


casting the DLM directly as a back-off n-gram model through pruning (Wang et al., 


2013), while Duh et al. (2013) used them for selecting relevant in-domain training 


data. 


Vector representations have also been studied extensively within the area of dis¬ 


tributional semantics (Turney and Pantel 2010) . There, vectors are typically con¬ 
structed directly from the occurrence counts of words in a corpus, with each dimen¬ 
sion of the vector space designating a specific context, for example. A variety of 
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techniques have been considered for how to compose word vectors into represen¬ 
tations of phrases or sentences (|Mitchell and Lapataj |20 1 0} |Baroni and - ZampareTfij 


2010 inter alia), which have been extended upon to address the question of com¬ 


posing morpheme vectors into words by Lazaridou et al. ( |2013 ). 

Other recent work has moved beyond monolingual vector-space modelling, in¬ 
corporating phrase similarity ratings based on bilingual word embeddings as a trans¬ 


lation model feature (Zou et al., 2013), or formulating translation purely in terms of 
distributed models (Kalchbrenner and Blunsom||2013[). 


Accounting for linguistically derived information such as morphology (Luong 


et al., |20 13; Lazaridou et al.[ 2013|) or syntax (Hermann and Blunsom[ 2013) has 


proved beneficial to learning vector representations of words. Our contribution is to 
create morphological awareness in a probabilistic language model. 


Parameter tying within an LBL has also been used in domain adaptation (Xiao 


and Guoj|20131, where the representation space is subdivided into source, target and 


background domains. 


4.6 Summary 

We introduced a method for integrating morphology into probabilistic distributed 
language models. Our method has the flexibility to be used for morphologically rich 
languages (MRLs) across a range of linguistic typologies. Our empirical evalua¬ 
tion focused on multiple MRLs and different tasks. The primary outcomes are that 
(i) our morphology-guided DLMs improve intrinsic language model performance 
when compared to baseline DLMs and n-gram MKN models; (ii) word and mor¬ 
pheme representations learnt in the process compare favourably in terms of a word 
similarity task to a recent more complex model that used more data, while obtain¬ 
ing large gains on some languages; (iii) machine translation quality as measured 
by Bleu was improved consistently across six language pairs when using DLMs 
during decoding, although the morphology-based representations led to further im¬ 
provements beyond the level of optimiser variance only for English—s-Czech. By 
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demonstrating that the class decomposition enables full integration of a normalised 
DLM into a decoder, we open up many other possibilities in this active modelling 
space. 
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Chapter 5 

Unsupervised learning of complex 
morphology 


Chapter Abstract 


The previous two chapters considered methods for integrating morpho¬ 
logical information into two different types of LMs on the assumption 
that an external source of information is available. In this chapter, we 
go beyond sequence-based language modelling to consider how mor¬ 
phological information can be obtained in the first place using unsu- 
pervised methods. We introduce a novel approach to modelling con- 
catenative and non-concatenative morphology jointly. This chapter is 


an extension of material originally published as (Botha and Blunsom. 


2013). 


Past work on unsupervised morphology learning has overwhelmingly considered 
concatenative morphology, where surface words can be analysed into a sequence of 
continuous morphemes or string segments. Consecutive instances of the Morpho- 
Challenge shared-task boosted contributions to unsupervised learning in this regard 
(Kurimo et al.| 2010). But another important aspect of unsupervised learning of 
surface-level morphology concerns non-concatenative processes, where the mor¬ 
phemes of a word do not necessarily coincide with contiguous strings but are instead 
dispersed throughout the word. This type of morphology has received much less at¬ 


tention within the realm of unsupervised learning—the survey by Hammarstrom and 
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Borin (2011) identifies only 10 of the 200+ studies within its purview as targeting 


non-concatenative morphology. 

Concatenative and non-concatenative morphological processes do not function 
in isolation. The classical example is Arabic, which uses non-concatenative pro¬ 
cesses to derive noun and verb stems, onto which various clitics and affixes attach 
in a concatenative fashion. As another example, Tagalog verb conjugation makes 
extensive use of infixing and circumfixing in its conjugation of verbsQln both these 
examples, a segmentational approach recovers some of the building blocks of words, 
but naturally misses out on certain regularities, e.g. the notion that the Arabic stems 
katabat (‘she wrote’) and >akotubu (‘I write’) are bound together by the shared, 
discontiguous root ktb; or that two Tagalog words sumulat and sinulatan (different 
conjugations of ‘wrote’) share the same root, sulat. An unsupervised model that 
does not have the expressivity to capture these kind of relations therefore cannot 
benefit from the strong signals they encode. Modelling non-concatenative morphol¬ 
ogy is thus a way of extracting more value from the unannotated text by reducing 
sparsity. 

The primary contribution of this chapter is the formulation of a probabilistic 
model that jointly captures concatenative and non-concatenative morphology, as ex¬ 
emplified above. We extend adaptor grammars (Johnson et al., 2007a) to a mildly 
context-sensitive grammar formalism that easily captures both kinds of surface-level 
morphology. We argue that the general approach is suitable to a variety of settings, 
but focus the application on Semitic morphology. 

The structure of the chapter is as follows: The next section provides more back¬ 
ground on Semitic morphology. The subsequent section reviews existing compu¬ 
tational treatments of Semitic morphology and explains the relative merits of an 
unsupervised approach. The modelling work follows in three parts: First, we cover 
background material for unsupervised learning with adaptor grammars. Secondly, 
we present the more expressive grammar formalism and demonstrate its application 


'Tagalog is spoken in the Philippines. 
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to Semitic morphology. Finally, we report empirical evaluation results on Arabic 
and Hebrew. 


5.1 Basic overview of Semitic morphology 


The Semitic language family is a suitable test-bed for the modelling approach we 
propose in light of the fundamentally non-concatenative root-templatic morphol¬ 
ogy]^ Member languages that are widely spoken today include Arabic, Hebrew, 
Amharic and Tigrinya. 

In contrast to many languages where the word stem is the most elementary se¬ 
mantic building block, Semitic stems are derived from semantic root morphemes 
and grammatical templates which combine in a non-concatenative wayj^] 

The roots consist of between two and five radicals which are exclusively con¬ 
sonants. Roots are often referred to by the number of radicals they have. Triliteral 
roots featuring three radicals are the dominant kind. Because of the dominance of 
triliteral roots, existing modelling approaches are often limited to this class. An aim 
of the model we introduce is that it should easily encode roots of different lengths. 

Templates encode grammatical information, such as number or person. To a first 
approximation, verb and noun stems are derived by interspersing root radicals into 
templates. In Arabic, for example, kitAb (‘book’), kutub (‘books’), katab (‘write’) 
and takAtab (‘correspond’) all derive from the root ktb. The root cannot in general 
be identified as a continuous substring of the stem. 

Such stems undergo further transformation to give rise to surface word forms. 
Various inflectional affixes and clitics attach to stems by concatenation to give rise 
to multi-segment surface words. A clitic embodies the syntactic properties of an 


2 Many different linguistic terms apply to this discussion. For simplicity’s sake we do not dwell on 
their subtle distinctions and use whichever term is most convenient and sufficiently precise in a given 
context. “Root-templatic” morphology is also referred to as root-and-pattern morphology. The par¬ 
ticular sense in which it is non-concatenative is also variously denoted as transhxation, intercalation 
and interdigitisation. 

3 Linguistic morphology distinguishes between the consonant-vowel skeleton (e.g. CVCVC) and 
the vocalism (e.g. i... a) (McCarthy 1981j >. We use “templates” to refer to the combination. 
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independent word but nonetheless surfaces as a bound morpheme. Pronouns, prepo¬ 
sitions, conjunctions and the definite article are all cliticised in Arabic, which means 
segmentation is challenging also when one ignores the non-concatenative element 
of the morphology. Indeed, one hypothesis we investigate is that an account of the 


underlying root has a disambiguating effect on segmentation (§5.8). 

The following list shows typical examples of the aforementioned morphological 
processes^] for surface words arising from stems based on the root ktb\ 


• verb person: katab (‘write’) =>■ kataba (‘he wrote’), katabat (‘she wrote’), 
katabona (‘we wrote’) 

• pronominal enclitics: katabatohu (‘she wrote it’) 

• tense proclitic: >akotubu (‘I write’) sa>akotubu (‘I will write’) 

• determiner: AlkitAbi (‘the book’) 

• prepositional proclitics: biAlokitAbi (‘for the book’) 

• conjunction: wabiAlokitAbi (‘and for the book’) 


5.1.1 Orthographic variety 


Another central property of Semitic languages is that their orthographic systems are 
so-called abjads, which means the explicit writing of vowels is partly or completely 
optional. The examples given so far use the less common vocalised form, as it il¬ 
lustrates the morphological processes better. In standard orthography, a system of 
diacritic marks is used to indicate that vocalisation, especially in literary and reli¬ 
gious texts. But the more dominant case is unvocalised text, which suppresses some 
vocalisation information. This can render individual words highly ambiguous so 
that their correct interpretation relies on sentential context. For example, the unvo¬ 
calised word token ktb has at least 15 readings]^] As the focus in this chapter is on 


4 See ( Habash) 2010 > for a more complete introduction to Arabic morphology. 

5 This example was obtained using the Xerox Arabic Morphological Analyzer, available at 

https://open.xerox.com/Services/arabic-morphology 
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type-based morphological analysis it does not include solving this disambiguation 
problem^] However, the relevance of the orthographic variation is that it complicates 
computational approaches to dealing with Semitic text. A key test for an unsuper¬ 
vised model is whether it can effectively handle such variation without requiring 
manual modification. Aside from a usability standpoint, the advantage of such a 
model is that it could leam directly from text that is heterogeneous in its vocalisa¬ 
tion. 

5.1.2 Complex non-concatenative processes outside scope 

The preceding characterisation of root-templatic derivation is a simplification. It ex¬ 
cludes cases where the correspondence between the characters in the surface word 
and those in the constituent morphemes is not one-to-one. Such cases occur where 
deletion and insertion of characters happens during word formation. The first in¬ 
stance of this is weak roots, where one or more of the radicals do not appear in the 
derived stem. The second instance concerns gemination, where there is a lengthen¬ 
ing of a consonant, e.g. the stem kuttAb (‘writers’) as derived from root kib. 

These phenomena are present in the data used for our empirical evaluation though 
not accounted for by the model we introduce. The reason for limiting the scope this 
way is that the probabilistic modelling would otherwise be significantly more com¬ 
plex. Instead, we focus on giving a unified probabilistic account of concatenative 
and non-concatenative processes purely at the level of surface strings. 


5.2 Existing approaches 


This section surveys the dominant existing computational treatments of Semitic mor¬ 
phology. The aim is to situate our unsupervised probabilistic approach among alter¬ 
native approaches while considering their relative merits. 

6 For studies that make direct use of sentential context to address this ambiguity problem, see 

20TT|. 


(Habash and Rambow 

2005 

Poon et al. 

2009 Lee et al. 
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5.2.1 Rule-based methods 


There is an extensive literature on encoding linguistic rules explicitly in formal com¬ 
putational frameworks to perform automatic morphological analysis. The dominant 
strand of work uses finite-state transducers (FSTs). FSTs are naturally suited to 
concatenative morphology, given that concatenation is one of the fundamental op¬ 
erations in regular languages. To deal with non-concatenative phenomena, FSTs 
require careful augmentation ( |Beesley and Karttune n 2003). Early efforts on both 
Arabic ( Kataja and Koskenniemi[ 1988) and Hebrew ( Lavie etah||1988| ) made exten¬ 
sions to the foundational two-level morphological framework (Koskenniemi] 1984] ) 
to cater for root-templatic morphology. More complete approaches have directly 
operationalised the influential theory of non-concatenative morphology proposed by 


McCarthy (1981), using multi-tape transducers that are able to process the root, 


template and vocalism of the Semitic stem separately (Kay, 1987, Kiraz 2000). 
The MAGAED system extends this approach further and its creators argue that the 
non-concatenative handling of stems is essential to deal efficiently with the different 
Arabic dialects (Hab ash et al.[ 2005 Habash and Rambow[ 2006| ). Other augmen¬ 
tations make use of registers to give FSTs some limited memory to handle short¬ 


term dependencies (Cohen-Sygal and Wintner 2006), whereas Gasser (2009) used 
weighted FSTs with a unification operation over attribute-value pairs. 

The FST-based approaches are broadly lexical in their view of inflection and 
derivation—the morphosyntactic properties of words are carried by lexical mor¬ 
phemes. A separate strand of work in Semitic computational morphology is based 
instead on inferential-realisational theories of morphology (Stump, 2001), where 


morphosyntactic properties are specified purely by rules. Finkel and Stump (2002) 
use this to formulate an account of Hebrew verb morphology using hierarchies of 


inherited rules. Smrz] (2007 ) developed a complete account of Arabic morphology 
that falls within this theoretical class. His ElixirFM system makes use of a functional 
morphology framework (Forsberg and Ranta, 2004[ ) to formulate a domain-specific 
language embedded in a functional programming language. 
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The aforementioned approaches in both strands of work share the property of 
encoding morphological analysis and generation. A more ad-hoc approach that is 
limited to analysis is that of the Buckwalter Arabic Morphological Analyser (BAMA 


Buckwalter, 2002). BAMA relies on lists of stems and affixes together with consis¬ 
tency constraints on the co-occurrence of different morphemes within a word. It 
ignores the non-concatenative derivation of stems, but its lexica, which also indicate 
root morphemes, have been influential in the development or evaluation of other 


methods (Darwish 2002; Smrz 2007; Rodrigues and Cavar 2007 inter alia). 

These rule-based approaches typically perform exceptionally well at their task, 
given the level of targeted design involved. But the limitations are substantial: A 
significant amount of manual effort and expert linguistic knowledge is required to 
develop the methods described in this section. This includes having a precise lin¬ 
guistic framework to operationalise, or a sufficient amount of lexical information as 
in the case of BAMA. Consequently, transferring a given method or system to other 
dialects or languages within the Semitic family can be a big undertaking. Finally, 
even within their target language, the rule-based approaches that have a lexical com¬ 
ponent invariably offer limited coverage and require manual addition for application 
to new domains. 

5.2.2 Supervised learning 

Supervised learning of morphology allows some of the limitations of a rule-based 
approach to be overcome, and shifts the resource requirement away from linguistic 
expertise toward data. Common sources of supervision include pairs of inflected 
words, words labelled with their roots, and lexica of affixes. A learning component 
then has to infer parametrisations or parameter values (or both) from these labelled 
resources. The existing literature on Semitic morphology that falls into this category 
is limited and fragmented. 

One relevant strand of work considers the simpler problem of predicting the 
Arabic broken plural (e.g. kutub) given a base form (e.g. katab). |Clark| ( [2001b[12002J ) 
applies pair hidden Markov models (HMMs), which have two emission streams, to 
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train stochastic FSTs from such word-pair data. The approach is well-motivated 
for non-concatenative phenomena such as Semitic stem derivation, but it does not 
perform well in evaluation. 

A more common theme has been to leverage the aforementioned sources of su¬ 
pervision to construct methods for identifying the root morpheme of a given word. 


Darwish (2002) combines pre-existing Arabic resources with various heuristics and 
rudimentary statistical models to devise an analyser that outputs a ranked list of 
candidate roots for a word. 


Daya et al. (2008) approached root identification from the perspective of character- 


based classification. They train off-the-shelf classifiers for each radical and consider 
various ways of combining their predictions. Their investigation demonstrates the 
feasibility of the supervised approach by obtaining F-scores in the eighties, while 
transferring it from Hebrew to Arabic required relatively little effort. The methods 
are however specific to triliteral roots and do not contain a segmentational compo¬ 
nent. 

Boudlal et al. (2009|) employed a first-order HMM with roots as hidden states 


to identify the root of a given word with an accuracy of 94%; they relied on greedy 
methods and dictionaries of templates, roots and affixes to limit the hidden state 
space. 

5.2.3 Unsupervised learning 

As stated in the introduction of this chapter, the existing work on unsupervised learn¬ 
ing of non-concatenative morphology is much more limited than concatenative mor¬ 
phology, and this extends to Semitic morphology. We take “unsupervised” to be 
situations where instances of the expected output of a method are not supplied dur¬ 
ing its training. 

Probabilistic generative models constitute an attractive approach in this category 
as they allow high-level intuitions to be encoded in a coherent fashion. The only 
work we are aware of that directly targets root-templatic morphology in this sense 


is the non-parametric Bayesian model of Fullwood and O’Donnell (2013). Their 
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model generates a word from a root, template, and “residue”, which captures the 
non-root characters of the word. For example, kitAbi would arise through the com¬ 
bination of root(ktb), template(r-r-r-) and residue(iAi). Each of these categories is 
modelled by an independent PYP-distributed lexicon, where the base distributions 
support arbitrary length sequences. Their evaluation is limited to template identifi¬ 
cation accuracy, where it performs very well. Two main shortcomings of the model 
are the lack of affix modelling and the atomicity of the lexicon elements: 


1. The model only targets stem-formation. Given a word and its inferred tem¬ 
plate and residue, one could in principle extract a segmentation into prefix, 
stem and suffix, but segmentation of multiple adjacent affixes is nonetheless 
precluded. 

2. Lexicon elements are atomic in the sense that their characters are sampled 
independently from a uniform distribution over the alphabet. This means there 
is no aggregation of probability mass around the recurrence of, say, consonants 
in the root lexicon—the PYP only caches the full strings. 


The approach we introduce in §|5.7 was developed contemporaneously with Full- 


wood and O’Donnell (|2013|), but specifically overcomes these two shortcomings, 


while aiming to be more easily extensible. 

Other probabilistic approaches targeting Semitic segmentation have exploited 


regularities beyond word types, such as using sentential context (Poon et al., 2009 
Lee et al.j 20111 or multi-lingual phrases (Snyder and BarzilayJ 2008). 


There are unsupervised methods that indirectly address root-templatic morphol¬ 
ogy without necessarily attempting to model the word formation process. Words 
sharing the same root can be clustered based on heuristic scoring of character tu¬ 


ples (de Roeck and Al-Fares 20001, which is useful for information retrieval. Ba- 


roni et al. (2002) similarly acquire morphologically related word pairs (e.g. Verkauf, 


Verkdufe ) from raw text using string-edit distances and distributional cuesQ Their 

7 This is a variation of the knowledge-free method by Schone and Jurafsky (2000), which was 
affix-based and thus limited to concatenative morphology. 
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approach is in principle applicable to Semitic morphology since the string-edit op¬ 
erations suit root-templatic morphology, but they frame the method as a potentially 
useful precursor to morphological analysis and did not apply it to Semitic languages. 


5.3 Unsupervised learning with CFGs 


Our overall approach to the problem of learning morphology from unannotated text 
is to construct a model based on rewrite rules. This type of approach has been ap¬ 
plied successfully to morphological segmentation (Johnson et al., |2007a|b| [Johnson 


2008) and offers well-defined mechanisms for encoding prior intuitions about the 


morphology. 

The following two sections develop the relevant background material for learn¬ 
ing with context-free grammars, which we subsequently build on for learning non- 
concatenative phenomena. 


5.3.1 Probabilistic context-free grammars 

A context-free grammar Ga-c, = ( S,N, T, P) is specified by a set of non-terminal 
symbols N, terminal symbols T and rewrite rules P of the form A —* 7 , where 
A G N and 7 G (N U T) + . S G N is the start symbol. A string w G T + in the im¬ 
plied context-free language is derived from the start symbol S through a sequence of 
rule applications 77, ..., r k G P that expand non-terminal symbols until no further 
expansions are possible. 

This symbolic grammar can be cast as a generative probabilistic model, known as 
a probabilistic context-free grammar (PCFG), by defining a probability distribution 
over each subset Pa C P that contains the rules rewriting a symbol A G N. That 
is, each rule r G Pa has an associated real value 9 r such that 0 < 9 r < 1 and 
YhreP A ~ stochastic procedure for generating a string from the model 

follows the same derivation pattern outlined above, but samples the rule r to use for 
expanding a given non-terminal A from a categorical distribution parametrised by 
9. The key point is that this assumes conditional independence among expansions 
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given a fixed Gcvg, so that the joint probability of a string w and a derivation R = 

r x ,..., rk is P(w, R\6 ) = [Lei? 


Two learning problems thus become apparent—what are the rules in P, and 
what are their probabilities 61 The former problem is challenging to pursue in a 
completely unsupervised fashion (cf. Sto lcke and Omohundro[ 1994} Clark} 2001a). 
If one assumes a set of rules is given, the second issue of learning their probability 


distributions can be approached through maximum likelihood estimation (Lari and 


Young 1990 Goodman] 1 1998 1) or in a parametric Bayesian fashion by introducing 


a Dirichlet prior over 9 (Johnson et al. 2007b). 


5.3.2 Context-free adaptor grammars 


Adaptor grammars (Johnson et al., 2007a) provide an elegant solution to both learn¬ 
ing problems mentioned above. The core idea of adaptor grammars is that whole 
sub-trees are memorised and can be reused when generating strings from a proba¬ 
bilistic grammar. This relaxes the independence assumptions governing non-terminal 
expansion in the generative process of a PCFG to better match the pattern that non¬ 
trivial constructions in language re-occur across different instances. For example, 
the same stem would occur in many different surface word forms. Likewise, a given 
discontiguous Hebrew root would occur across different stems, which is the kind 
of regularity we intend to capture with our extension of adaptor grammars beyond 
context-free rules (§ }5.6.2[ ). 

The tendency for a sub-tree fragment to be reused is governed by the choice of 


adaptor function. We follow earlier applications (e.g. Johnson et al., 2007a, Huang 


et al., 2011) and use the Pitman-Yor process (PYP) as adaptor function (Pitman, 
1995; Pitman and Yor t |1997 ). As with the language models discussed in chapter 3] 
the PYP’s extensibility via its base distribution and its power-law behaviour make it 
suitable for modelling distributions over trees. 
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5.3.2.1 Formalism 


A Pitman-Yor context-free adaptor grammar (PY-CFAG) is specified as a tuple Q = 
(^pcfg, M, a, b, a), where Gpcvg > s a PCFG as defined before and M C A r is a set 
of adapted non-terminals. The vectors a and b , indexed by the elements of M, 
are the discount and concentration parameters for each adapted non-terminal, with 
a E [0,1] and b > 0. a. are parameters to Dirichlet priors on the rule probabilities 9. 

PY-CFAG defines a generative process over a set of trees T. Unadapted non¬ 
terminals A' E N \ M are expanded in the standard way of PCFGs, as described in 
the previous section. 

For each adapted non-terminal A E M there is a cache C a which stores termi¬ 
nating tree fragments having A as their root. We denote the tree fragment in C'a that 
corresponds to the i th expansion of A as z { . A sequence of indices z { = z 1} ..., Zi 
therefore assigns each individual expansions of A to some tree fragment in the cache. 

Given a cache Ca that has n previously generated trees comprising m unique 
trees each used ni,... n m times (where n = n >A, the tree fragment for the 
next expansion of A, z n+1 , is sampled conditional on the previous assignments z n 
according to 

{ n k - a 
n + b 

am + b 
n + b 

where a and b are the elements of a and b corresponding to A. 

1. The first case denotes the situation where a previously cached tree is reused 
for this n + 1 th expansion of A. This expands A with a fully terminating tree 
fragment in a single step, meaning that none of the nodes descending from A 
in the tree being generated are subject to further expansion. 

2. The second case by-passes the cache and expands A according to the rules Pa 
and rule probabilities 9a of the base grammar (?pcfg- This samples a sub¬ 
tree with root A from the PYP base distribution, a process during which other 


if 1 < z n+ 1 < m 
if z n+ 1 = m + 1, 


(5.1) 
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caches Cb (s.t. B e M,B ^ A) may come into play when expanding descen¬ 
dants of A; thus a PY-CFAG can define a hierarchical stochastic process. 

Both cases eventually result in a terminating tree-fragment for A, which is then 
added to the cache, updating the counts n, n Zn+1 and potentially m. 

The adaptation does not affect the string language of Q s , but it maps the distri¬ 
bution over trees to one that is distributed according to the PYP. 

53 . 2.2 Recursion 


Context-free grammars may, in general, include recursion in their rewrite rules, ei¬ 
ther directly, e.g. X —> X Y, or indirectly, e.g. X —> Y Z and Z —» XQ. 

A PY-CFAG that features this type of recursion among its adapted non-terminals 
M complicates sampler-based inference, since it can involve repeated sampling 
from the same CRP without being able to update its counts correctly. In theory, 
a Metropolis-Hastings accept/reject step could be used as remedy but it is not ob¬ 
vious what proposal distribution would lead to acceptable rejection rates. Thus we 
follow Johnson et al. ( 2007a[ ) and restrict the adaptor grammars we use to exclude 
recursive rewrite rules among adapted non-terminals. (This is what the “B ^ A’’ 
condition in step 2 of the previous section alludes to.) 


5.4 An idealised grammar for learning morphology 


A unifying theme in this thesis is to show that specific though high-level intuitions 
about morphology can be encoded in probabilistic models to improve their quality 
and capabilities. The high-level intuition we want to employ here is that morpho¬ 
logical derivation and inflection are often non-concatenative. 


Context-free adaptor grammars (§5.3.2) are effective at modelling concatenative 
morphology (Johnson et al. 2007a| Johnson} 2008). They offer the convenience of 
being able to capture the essence of phenomena ranging from single-slot inflection 


99 















to agglutinative derivation by writing down a few context-free rewrite rules: 

Word —> (Pre* Stem Suf*) + (5.2) 

Stem | Pre | Suf —» Morph (5.3) 

Morph —> (terminal strings ) (5.4) 

Our objective is to be able to specify additional rewrite rules of the following 
kind that would capture non-concatenative phenomena as exemplified: 

Stem —* intercal (Root, Template) (5.5) 

e.g. Arabic derivation k-t-b + i-a kitAb (‘book’) 

Stem —infix (Stem, Infix) (5.6) 

e.g. Tagalog sulat (‘write’) =>- sumulat (‘wrote’) 

Stem —> circfix (Stem, Circumfix) (5.7) 

e.g. Indonesian percaya (‘to trust’) =>- kepercayaan (‘belief’) 


The bold-faced “functions” combine the potentially discontiguous yields of the 
argument symbols into single contiguous strings, e.g. infix(s-ulat, um) produces 
stem sumulat. 

The question is how to express the multi-argument functions appearing in the 
idealisation above in terms of a formal rewrite grammar that can function as an 
adaptor grammar. We propose the use of a mildly context-sensitive grammar for¬ 


malism (Joshi [1985 ) that will allow for a concrete definition of intercal that is a 
formal rewrite rule. The more powerful grammar formalism retains the modelling 
convenience, and is notably consistent with the universal morphological rule con¬ 
straint proposed by |McCarthyj ( | 1981] p. 405), whereby “morphological rules must 
be context-sensitive rewrite rules affecting no more than one segment at a time, and 
no richer type of rule is permitted in the morphology”. It will allow a discontiguous 
string like k-t-b to be dominated by a single non-terminal node in a parse tree, which 
is very useful for probabilistic modelling. 
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5.5 Simple range concatenating grammars 


We apply simple Range Concatenating Grammars (SRCGs) (Boullier 2000 ) to 
parse contiguous and discontiguous morphemes from an input string. These gram¬ 
mars are mildly context-sensitive, forming a superset of context-free grammars that 
retains parsing time-complexity that is polynomial in the input length^] An SRCG- 
rule operates on vectors of ranges over the input, in contrast to the way a CFG-rule 
operates on spans (single ranges). In this way a non-terminal symbol in an SRCG 
(CFG) derivation can dominate a subset (substring) of terminals in an input string. 


5.5.1 Formalism 

An SRCG Q is a tuple (. N , T, V, P, S), with finite sets of non-terminals (N), termi¬ 
nals (T) and variables ( V ), with a start symbol S E N. A rewrite rule p E P of rank 
r = p(p) >0 has the form 


A(ai, ..., aif(A)) ..., /3i,</,(Bi)) ... B r (f3 r ^ i,..., (3 r ,ip(B r ))i 


where each a E (T U V)*, each (3 E V, and if>(A) is the number of arguments a 
non-terminal A has, called its arity. By definition, the start symbol has arity 1. Any 
variable v E V appearing in a given rule must be used exactly once on each side of 
the rule. Terminating rules are written with e as the right-hand side and thus have 
rank 0. 


A range is a pair of integers (i,j) denoting the substring w i+1 ... uy of a string 
w — W\... w n . A non-terminal becomes instantiated when its variables are bound 
to ranges through substitution. Variables within an argument imply concatenation 
and therefore have to bind to adjacent ranges. 

An instantiated non-terminal A' is said to derive e if the consecutive application 
of a sequence of instantiated rules rewrite it as e. A string w is within the language 


8 Our formulation is in terms of SRCGs, which are equivalent in power to linear context-free 
rewrite systems (Vijay-Shank er et aL]|1987| ) and multiple context-free grammars ( Seki et aT)|1991| l, 
all of which are weaker than (non-simple) range concatenating grammars ( Boullier| |2000| l. A good 
overview of these formalisms and their parsing complexity is given by Kallmeyer (2010). 
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defined by a particular SRCG iff the start symbol S, instantiated with the exhaustive 
range (0, w n ), derives e. 

An important distinction with regard to CFGs is that, due to the instantiation 
mechanism, the ordering of non-terminals on the right-hand side of an SRCG rule 
is irrelevant, i.e. A(ab) B(a)C(b) and A(ab) —> C(b)B(a ) are the same rulej^J 
Consequently, the isomorphisms of any given SRCG derivation tree all encode the 
same string, which is uniquely defined through the instantiation process. 


5.5.2 Application to root-templatic morphology 


A fragment of the idealised grammar schema from the previous section (§5.4) can 
be rephrased as an SRCG by writing the rules in the newly introduced notation, and 
supplying a definition of the intercal function as simply another rule of the grammar: 


Word (abc) —> Pre(a) Stem(fc) Suf(c) 

Stem (abcde) —> Root (a, c, e ) Template( 6 , d), 


where a, b, c,d,e E V. An instantiation of the latter rule with w = kitAb is 


Stem((0..1), (1..2), (2..3), <3..4), <4..5» Root((0..1), <2..3>, <4..5)) 

Template((1..2), (3..4)) 

Given an appropriate set of grammar rules (as we present in § j5.7[ ), we can parse 
an input string to obtain a tree as shown in Figure |5.1[ The overlapping branches of 
the tree demonstrate that this grammar captures something a CFG could not. From 
the parse tree one can read off the word’s root morpheme and the template used. 


5.5.3 Time complexity and tractability 


Although SRCGs specify mildly context-sensitive grammars, each step in a deriva¬ 
tion is context-free—a node’s expansion does not depend on other parts of the tree. 
This property implies that a recognition or parsing algorithm can have a worst-case 


y Certain ordering restrictions over the variables within an argument need to hold for an SRCG to 
indeed be a simple RCG (Boullier 2000|. 


102 










Word(wakitAbi) 


Pre(wa) 


Stm(kitAb) 


Suf(i) 


w a 


Root(k,t,b) Template(i,A) 
k i t A b 


Figure 5.1: Example derivation for wakitAbi (and my book) using the SRCG fragment 
from jj |5.5.2| CFGs cannot capture such crossing branches. 

time complexity that is polynomial in the input length n. The asymptotic expres¬ 
sion is 0(n( p+1 ^) for arity 0 and rank p, which reduces to 0(n 3 ^) for a binarised 
grammar (Rodriguez and Sattal 2009 [ Kallmeyer 2010 ch. 7); the formal basis for 


this result traces back to analyses by |Seki et al. <\ 199 1) ; |Boullier| ( |2000| ) , inter alia. 
We briefly state the intuition. For a binarised grammar, the worst-case occurs for 
a rule A(a 1; ..., a^) -A ..., /3^) C( 71 ,..., 7 ^), with each consisting of 

two variables UiVi. In a bottom-up parser of SRCG (Kallmeyer, 2010), 0 ranges, 
hence 20 indices into the input string, have to be tracked for each right-hand side 
item. The adjacency in the arguments of A implies that the end point of the range 
bound to Ui is the same as the starting point of the range bound to v t , so that the total 
number of independent indices are 30, each ranging up to n. 

To capture the maximal case of a stem comprising k discontiguous templatic 
character segments and k — 1 interspersed root characters would require a grammar 
that has arity 0 = k. For Arabic, which has up to quadriliteral roots, hence k < 5, 
the time complexity would be at most 0(n ]r> ). This is a daunting proposition for 
parsing, but we are careful to set up our application of SRCGs in such a way that 
this is not too big an obstacle: 

Firstly, our grammars are defined over the characters that make up a word, and 
not over words that make up a sentence. As such, the input length n would tend to 
be shorter than when parsing full sentences from a corpus. 

Secondly, we do type-based morphological analysis, a view supported by evi¬ 


dence from Goldwater et al. (2006), so each unique word in a dataset is only parsed 
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once with a given grammar. The set of word types attested in the data sources of in¬ 
terest here is fairly limited, typically in the tens of thousands. For these reasons, our 
parsing and inference tasks turn out to be tractable despite the high time-complexity. 
As a side remark, note that by refactoring rules the arity and rank can be balanced 


in a way that is optimal for parsing time-complexity, as characterised by Gildea 


( |2010 ). It is thus conceivable that the appropriate refactoring of our grammars could 
bring down the complexity from what is given above. 


5.6 Simple range concatenating adaptor grammars 

We have argued that generative grammars are convenient for expressing linguistic 
intuitions of word formation (§ |5.4[ ) and demonstrated how SRCGs can be used to 
encode root-templatic morphology (§ |5.5.2[ ). In this section, we tie together SRCGs 
and adaptor grammars to obtain a framework capable of learning non-concatenative 
morphology in an unsupervised fashion. 


5.6.1 Probabilistic SRCG 


The probabilistic extension of SRCGs is similar to the probabilistic extension of 
CFGs to PCFGs, and has been used in other guises ( Kato et ah| 2006; Maier[ 2010). 


Each rule r e P of the SRCG Qs (as defined in §5.5.1) has an associated probabil¬ 
ity 6 r such that J2 r eP A = 1- A random string in the language of the grammar can 
then be obtained through a generative procedure that begins with the start symbol S 
and iteratively expands it until deriving e: At each step for some current symbol A, 
a rewrite rule r is sampled randomly from P A in accordance with the distribution 
over rules and used to expand A. This procedure terminates when no further expan¬ 
sions are possible. Of course, expansions need to respect the range concatenating 
and ordering constraints imposed by the variables in rules. The expansions imply a 
chain of variable bindings going down the tree, and instantiation happens only when 
rewriting into es but then propagates back up the tree. 
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The probability P(w, t ) of the resulting tree t and terminal string w is the product 
Or over the sequence of rewrite rules used. 

The invariance of SRCGs trees under isomorphism does not make the proba¬ 
bilistic model deficient. It adds a source of spurious ambiguity that we pre-empt 
in practice by requiring that grammar rules are specified in a canonical way which 
ensures a one-to-one correspondence between the order of nodes in a tree and of 
terminals in the yield. 


5.6.2 PYSRCAG 


A Pitman Yor simple Range Concatenating Adaptor Grammar (PYSRCAG) is spec¬ 
ified as a tuple Q = {Q s , M, a, b, a), where Q s is a probabilistic SRCG, M is the 
set of adapted non-terminals and the hyperparameters are analogous to those of the 
PY-CFAG (§ |553) . 

The inference procedure under our model is very similar to that of PY-CFAGs, 
so we restate the central aspects here but refer the reader to the original article by 


Johnson et al. (2007a) for further details. First, one may integrate out the adaptors 


to obtain a single distribution over the set of trees generated from a particular non¬ 
terminal. Thus, the joint probability of a particular sequence 2 for the adapted non¬ 
terminal A with cached counts (ni,, n m ) is 


py ( , M ) = 


n 


n—1 1 
z=0 ' 


b) 


(5.8) 


Taking all the adapted non-terminals into account, the joint probability of a set of 
full trees T under the grammar Q is 

P{T\a,b,a) = J] PY(z(T)\a,b), (5-9) 


Asm 


where /a is a vector of the usage counts of rules r E Pa across T, and B is the Euler 
beta function. 

The posterior distribution over a set of strings w is obtained by marginalising 
( |5.9[ ) over all trees that have w as their yields. This is intractable to compute di¬ 
rectly, so instead we use MCMC techniques to obtain samples from that posterior 
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using a component-wise Metropolis-Hastings sampler^ The sampler works by vis¬ 
iting each string w in turn and drawing a new tree for it under a proposal grammar Qq 
and randomly accepting that as the new analysis for w according to the Metropolis- 
Hastings accept-reject probability. As proposal grammar, we use the analogous ap¬ 
proximation of our Q as Johnson et al. used for PCFGs, namely by taking a static 
snapshot Qq of the adaptor grammar where additional rules rewrite adapted non¬ 
terminals as the terminal strings of their cached trees. Drawing a sample from the 
proposal distribution is then a matter of drawing a random tree from the parse chart 
of w under Gq. The main distinction of the PYSRCAG therefore lies in the use of 
an SRCG parser to produce the chartp] 


5.7 PYSRCAG model for root-templatic morphology 


The previous sections have laid the ground-work for learning adaptor grammars that 
can cover discontiguous strings. This section provides the detail for how we apply 
PYSRCAGs to model Semitic morphology, covering both its concatenative and root- 
templatic aspects. 


5.7.1 Words as concatenated morphemes 

We start with a CFG-based adaptor grammar that models words as consisting of a 
stem and any number of prefixes and suffixes 


Word —>■ Pre* Stem SuP 


(5.10) 


Pre I Stem I Suf —>■ Char 4 


(5.11) 


10 An alternative to MCMC is to do variational inference on top of the stick-breaking representation 
of the PYP (Cohen et ak]|20l0]l. 

11 We acknowledge the use of Mark Johnson’s publicly available implementation of PY-CFAGs, 
which we extended as necessary for SRCGs. 

12 Adapted non-terminals are indicated by underlining and we use the following abbreviations: 
X —* Y + means one or more instances of Y and encodes the rules X —> Ys and Ys —>■ Ys Y | Y. 
Similarly, X —> Y* Z allows zero or more instances of Y and encodes the rules X —> Z and 
X —> Y + Z. Further relabelling is added as necessary to avoid recursion among adapted non¬ 
terminals. 
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This fragment can be seen as building on the stem-and-affix adaptor grammar pre¬ 


sented in (Johnson et a!4 (2007a ) for morphological analysis of English, of which a 
later version also covers multiple affixes (Sirts and Goldwater, [2013| ). In the particu¬ 
lar case of Arabic, multiple affixes are required to handle the attachment of particles 
and clitics onto base words. 


5.7.2 Complex stems 

In the preceding grammar, Stem is purely a sequence of characters without further 
cached sub-structure. Although there are certain word forms for which this is an 
appropriate view, this rule is obviously blind to the reuseable sub-structures inherent 
in root morphemes and templates. 

Here, we make the logical extension of Stem to complex stems. To establish 
the necessary conventions, we start with a rule-set that is appropriate for modelling 
triliteral roots: 


Stem(abcde fq) —> R3(5, 

d,f) 

T4 (a,c,e,g) 

(5.12) 

Stem (abcdef) —> R3(a, 

c,e) 

T3(M,/) 

(5.13) 

Stem (abode) —>■ R3(a, 

c,e) 

T2(M) 

(5.14) 

Stemf'a&ccf) —> R3(a, 

c, d) 

Tl(6) 

(5.15) 

Stem(a&cj —>■ R3(a, 

b,c) 


(5.16) 


Some of these cases call for additional rules that permute the variables. For example, 
for handling other possible ways of forming a stem from a triliteral root and a single 
templatic character, rule ( |5.15[ ) also has a variant Stem(a6cd) —>■ R3(a, b , d ) Tl(c). 
In these and other rule-sets given subsequently, such permuted variants are sup¬ 
pressed but implied. 


The justification for rule ( |5.16[ ), which may look trivial, is that it is common 
in unvocalised text for the stem to be nothing more than the concatenation of the 
root characters. The inclusion of that rule is thus important for sharing statistical 
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strength of a given root across all stems it may occur in, including ones that feature 
no templatic characters. 

Note that although we provide the model with two sets of discontiguous non¬ 
terminals, R and T, we do not pre-specify their mapping onto terminal strings. No 
subdivision of the alphabet into vowels and consonants is hard-wired. 

To complete the ruleset, a discontiguous non-terminal An is rewritten through 
recursion on its arityp] 

An(v \,..., v n ) — > Al(vi, ..., u n _i) Char(u n ) (recursive case; 1 < l = n — 1) 

Al(u) — y Char(w) (base case) 

where E V are variables and Char rewrites all terminals t E T as Char(t) -A e. 

5.8 Experiments 

We evaluate our model on Modern Standard Arabic, Quranic Arabic and Hebrew. 
These languages are closely related in their morphology, and feature lexical cog¬ 
nates. But they are sufficiently different so that the transferral of rule-based mor¬ 
phological analysers from one to the other typically require manual intervention. A 
key question in this evaluation is therefore whether an appropriate instantiation of 
our adaptor grammars successfully generalises across these related languages. 

We evaluate on three tasks: segmentation of a word into a linear sequence of 
morphemes, identification of a word’s root, and the acquisition of lexica of stems, 
affixes and roots. 

5.8.1 Evaluation data 

Our models are unsupervised and therefore leam from raw text, but their evaluation 

requires annotated data as a gold-standard and these were derived as follows: 

13 Including the arity as part of the non-terminal symbol names forms part of our convention here 
to ensure that the grammar contains no cycles, a situation which would complicate inference under 
PYSRCAG. 
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Bw 

Bw' 

Root 

wlAt/./j/.- IHmsrvf 

walAta/./,./; foHanw™ 

f-H-m 

EArDy.sT/v/ 

EAriDiyy S7M 

Er-D 

w /w; jbA.sT v/ hm S (/r 

wa PRE jabbA S7M hurn w/ 7 

jbn 

fllpR£ mkAtb,s7M 

fali 1 PRE mukAtib STV / 

ktb 


Table 5.1: Example of words in our generated dataset Bw, shown in Buckwalter translitera¬ 
tion. 

5.8.1.1 Arabic (MSA) 


We created the dataset B w by synthesising 50k morphotactically correct word types 
from the morpheme lexicons and consistency rules supplied with the Buckwalter 
Arabic Morphological Analyser (BAMA, Buckwalter, 2002) 7^] This allowed control 
over the word shapes, which is important to focus the evaluation, while yielding 
reliable segmentation and root annotations. Bw has no vocalisation; we denote the 


corresponding vocalised dataset as Bw . Example words are presented in Table 5.1 


5.8.1.2 Quranic Arabic 


We extracted the roughly 18k word types from 


of the Quran (Dukes and Habash] 2010J). As 
given diacritics intact for this dataset, Qu . 


a morphologically analysed version 
an additional challenge, we left all 


5.8.1.3 Hebrew 


We leveraged the Hebrew CHILDES database as an annotated resource (Albert et al. 


2013), specifically using both the adult and child utterances from the Berman and 


Ravid longitudinal corpora. 5k word types featuring at least one affix each were ex¬ 


tracted to form our dataset Heb. For words marked as non-standard child language, 

14 We used BAMA version 2.0, LDC2004L02, and sampled word types having a single stem and at 
most one prefix, suffix or both, according to the following random procedure: Sample a shape accord¬ 
ing to (stem: 0.1, pre+stem: 0.25, stem+suf: 0.25, pre+stem+suf: 0.4). Sample uniformly at random 
(with replacement) a stem from the BAMA stem lexicon, and affix(es) from the ones consistent with 
the chosen stem. The BAMA lexicons contain elementary and compound affixes, so some of the 
generated words would permit a linguistic segmentation into multiple prefixes/suffixes. Nonetheless, 
we take as gold-standard segmentation precisely the items used by our generation procedure. 
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Dataset 

Words 

Stems 

Roots 

m/w 

c/w 

Bw 

48428 

24197 

4717 

2.3 

6.4 

Bw' 

48428 

30891 

4707 

2.3 

10.7 

Qu' 

18808 

12021 

1270 

1.9 

9.9 

Heb 

5231 

3164 

492 

2.1 

6.7 


Table 5.2: Corpus statistics in number of types, including average number of morphemes 
(m/w) and characters (c/w) per word type. The Roots column gives the number of distinct 
surface-realised roots of length 3 or 4. 


we used the corrected forms provided by the database. Stressed and unstressed vow¬ 
els were conflated to overcome some inconsistencies in the source data. 


5.8.2 Model instantiations 


We use as baselines adaptor grammars the context-free models that express mor¬ 
phemes as sequences of characters. We consider two variants that match the known 
characteristics of our datasets: Concat allows only up to one of each morpheme 
type in a word, while MConcat allows multiple prefixes and suffixes. 

Using these concatenative grammars as starting points, we formulate more ex¬ 
pressive grammars by adding rules that feature discontinuity. The canonical complex 
stem grammars aimed at triliteral roots are denoted as Tpl and MTpl, depending on 
the starting point grammar. 

Informal inspection of the data revealed that cases exist where multiple charac¬ 
ters intervene between root characters. We thus also experiment with a variant that 
allows the non-terminal T1 to be rewritten as up to three Char symbols. This model 
is denoted Tpl3Ch. 

For efficiency reasons, the aforementioned models are limited to rules of arity 
at most 3. The grammars Tpl+T4 and TplR4 relax this constraint to include non¬ 
terminal categories T4 and R4 & T4, respectively, in order to model quadriliteral 
roots as well. 


These model definitions are given in more detail in Table 5.3 
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5.8.2.1 Sampling details 


Our PYSRCAGs have various hyperparameters. The PYP hyperparameters a and b 
are modelled by placing flat Beta(l, 1) and vague Gamma(10, 0.1) priors on them, 


respectively. Their values are then inferred using slice sampling (Neal 2003, John¬ 


son and Goldwater 2009). The hyperparameters a are set to give symmetric Dirich- 


let prior distributions over the rule probabilities 9. 

We start the sampler using batch initialisation, meaning the initial parse tree of 
each word is generated purely from the base distribution grammar and is not influ¬ 
enced by the PYP priors. Johnson and Goldwater ( 2009j ) found this approach to 
lead to higher posterior probabilities than initialising incrementally according to the 
model itself. Another option explored by that work is to include a type-based ele¬ 
ment into the sampler by occasionally resampling the cached trees themselves. This 
can improve mixing by updating the analyses of multiple words in a single step. Ex¬ 
perimentation with these variations in initialisation and sampling was not central to 
our aim of applying SRCG-based adaptor grammars to root-templatic morphology, 
but it is conceivable that they may enable small improvements over our results for 
the same grammars we propose. 

For each adaptor grammar, we collected 100 posterior samples after allowing 
900 iterations of burn-in of the MCMC sampler described in §5.6.2 Morphological 
analyses were obtained from the collected samples in various ways, as set out per 
task in the following sections. To facilitate exposition, let S denote the J = 100 
collected samples, where the j th sample S <3) = {t^ 1 } consists of a parse tree t { / } for 
each word type wy in the dataset, 1 < i < N and 1 <3 <J. 


5.8.3 Task 1: Segmentation 

5.8.3.1 Method 

The segmentation of a word under our adaptor grammars is determined unambigu¬ 
ously by traversing the parse tree assigned to it. We split a word into morphemes 
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Concat 


Word —y Pre Stem Suf | Pre Stem | Stem Suf | Stem 
Pre [ Stem j Suf —> Char + 


c 

MConcat = 

ioncat U ... 

r Word —* Pres Stem Sufs Pres Stem | Stem Sufs 

Pres —* Pre + 

Sufs —* Suf 1 " 

Tpl = < 

Stem (abode f) — > R3(a, c, e) T3(6, d, f) 

Stem (abode) — » R3(a, c, e) T2(6, d) 

Stem (abed) — > R3(a, c, d) Tl(6) 

Stem(afec) — > R3(a, b , c) 

Rn T n — > (recursion described in f 5.7.2J) 

Tpl+T4 = Tpl U {Stem 

(i abedefg ) —> R3(6, d, f) T4(a, c, e, g )} 

Tpl3Ch = Tpl U (T1 Char Char 

Char Char Char} 

TplR4 = Tpl+T4 U {Stemfa&cde fqh) — y R4(6, d, f , h) T4(a, c, e, q)} 

MTpl = MConcat U Tpl 


Table 5.3: Summary of model identifiers in terms of canonical grammar rules. Certain rule 
shapes involve a permutation in variables, but these are not shown explicitly. 


along the boundaries of the substrings dominated by the non-terminals Stem, Pre, 
and Suf. 

We use the maximum a posteriori (MAP) parse tree t(w) of a word w, which is 
simply the tree that is assigned to that word most frequently in the collected posterior 
samples^ The joint probability of a word and tree can be factored as 

P(w, t | a, b, w) = P{w | t) P(t | a, b, w), (5.17) 


where the first term is 1 if t yields w, and 0 otherwise; a and b are the hyperpa¬ 
rameters, and w the observed data. The second term is the posterior distribution 
over trees (cf. Equation 5.9[ ) and we can obtain an estimate of the probability of a 
tree under this distribution by aggregating over the collected samples 5^,..., Sj J \ 


15 An alternative method for obtaining a segmentation from the collected posterior samples is to 
use maximum marginal decoding, which marginalises out sub-structures which are irrelevant to seg¬ 
mentation ( Johnson and Go Id water; 20091. In our experiments, this approach yielded very similar 
segmentation results as the MAP-based method, and we thus focus on the latter. 
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where i is such that Wi = w. That is, 


Pit | a, b, w) « - 


(5.18) 


3 = 1 


where 5 is Kronecker-delta. Thus for a fixed word w, the MAP parse tree t(w) is 


argma xP(w,t a. b,w) 

t 

(5.19) 

argmaxP(f | a, b, w) (fixed w) 

t 

(5.20) 

j 

arg max 5 t (t {). 

(5.21) 


3 = 1 


Segmentation quality is measured using the segment border FI-score (SBF, Sirts 
and Goldwater[ 2013). This is a micro-averaged metric comparing the word-internal 
segmentation points of the predicted segmentation against the gold standard seg- 
mentationF’l 

As external baseline model we used Morfessor (Creutz and Lagus} 2007j ), which 
performs decently in unsupervised morphological segmentation of a variety of lan¬ 
guages, but only handles concatenation. 


5.8.3.2 Results 


The introduction of complex-stem forming rules consistently brings large gains in 
segmentation quality compared to the purely concatenative baseline adaptor gram¬ 


mars (Table 5.4). 


Across our data sets, the simplest templatic grammar featuring triliteral roots 
obtains an average relative increase in SBF of 21% (average SBF 47.8-^58.0). These 
gains are not distributed uniformly across data sets and range from 12% to 29%. 
But regarding the question of whether a particular templatic adaptor grammar can 
deliver benefits across both Arabic and Hebrew, we conclude that the answer is in 
the positive for the segmentation task. 


16 We acknowledge the use of the evaluation script by Sharon Goldwater, obtainable from 

http://homepages.inf.ed.ac.uk/sgwater/resources.html 
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Bw' 

Bw 

Heb 


Qu' 

Morfessor 

55.6 

40.0 

24.2 

Morfessor 

44.3 

Concat 

47.4 

64.2 

60.1 

MConcat 

19.6 

Tpl 

60.4 

71.9 

77.3 

MTpl 

22.5 

Tpl3Ch 

60.5 

72.2 

77.4 

MTpl3Ch 

25.7 

Tpl+T4 

64.5 

71.6 

77.1 

MTpl+T4 

24.8 

TplR4 

74.5 

73.7 

78.6 




Table 5.4: Segmentation quality in segment-border F-score (SBF). 


Variation in SBF among the different variants of the triliteral grammars is small 
and mixed across data sets. 

The vocalised Arabic data set Bw’ is most sensitive to changes in grammar ex¬ 
pressivity. Introduction of the T4 non-terminal to the basic triliteral grammar in¬ 
creases SBF on Bw' by 4 points, on top of which the most expressive grammar, 
TplR4, adds another 10 points. This amounts to a substantial total improvement 
of 57% by TplR4 over the baseline Concat. 

On the unvocalised B w, grammar expressivity beyond Tpl brings smaller gains. 
This is consistent with the fact that B W represents an easier segmentation task, since 
its unvocalised words are by definition shorter, on average, than those in Bw , and 
cover fewer distinct contiguous morphemes. 

When repeating the experiment for a version of the Hebrew training data that 
includes affixless words, SBF drops to the fifties. The overall trend that the templatic 
grammars improve segmentation remains, and the relative improvement is slightly 
larger in this setting. 

A salient aspect of these results, when considered in conjunction with the root 


identification results to be presented in § 5.8.4.3 is that templatic rules aid segmenta¬ 
tion quality independent of root identification accuracy. In other words, our attempt 
at capturing discontiguous roots and templates benefits segmentation quality by al¬ 
lowing the model to capture discontiguous regularities at the sub-stem level, which 
is advantageous regardless of whether the captured elements actually correspond to 
the linguistically motivated analyses. 
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Figure 5.2: Distribution of number of morphemes per word for data set Qu , showing the 
large discrepancy between the gold standard and the induced segmentations. 

5.8.3.3 Further analysis of Quran segmentation 

The trend that templatic rules improved segmentation quality holds for the Quran 
data set (improvements of 15% to 31% relative to the baseline adaptor grammar), 
but absolute SBF for the adaptor grammars is disappointingly low for this data set. 

Our analysis shows that all the evaluated adaptor grammars over-segmented 
heavily. As an illustration, the gold standard has on average 1.9 morphemes per 


word (m/w, Table 5.2), while the models hypothesise 2.8-3.2. The distribution 
of m/w is very different. The gold standard is dominated by words having one 
or two morphemes, giving a skewed distribution of m/w, while the adaptor gram¬ 


mars induce a more symmetric distribution where length three words dominate (Fig¬ 


ure 5.2). If we evaluate only on the subset of words that have 3 or more morphemes 
according to the gold standard, segmentation quality is higher (MConcat: 32.7, 
MTpl3Ch: 38.6). 

We investigated two hypotheses that could explain the low SBF scores on this 
data set. The first hypothesis is that the outcome is an artefact of our methodology 


for extracting segmentations from parse trees. As detailled in §5.8.3.1, we split 
words maximally along the elementary affix categories, Pre and Suf even though 


the grammars used for Qu allow composite affixes. In Table 5.5 we designate this 
as the fine-grained grammar evaluated at the fine-grained level. An alternative is to 
split at the boundaries of the Stem category only and to ignore the internal structure 
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Evaluation 
Coarse Fine 


Concat (coarse grammar) 
MConcat (fine grammar) 


13.2 n/a 
9.6 +19.6 


Table 5.5: Effect of granularity choices in grammar design and evaluation when measuring 
segmentation qu ality on data set Qu . These results are for the strictly concatenative adaptor 
grammars. See §5.8.3.3 for details on “coarse” versus “fine” evaluation. 119.6 is highlighted 


as being the combination presented in Table 5.4 


of composite affixes. That is, evaluate the line-grained grammar in a coarse-grained 
fashion. Finally, we can train a coarse-grained grammar (Concat) instead of a fine¬ 
grained one (MConcat), and evaluate coarsely. 

The outcome of this two-way comparison, using only the concatenative base¬ 
line adaptor grammars for efficiency, is that all three valid combinations give low 


SBF scores (Table 5.5). We thus reject the hypothesis that these methodological 
choices by themselves explain the below-baseline performance of the adaptor gram¬ 
mars on Qu\ 

The second hypothesis we considered is that performance suffered because of 
a lack of hierarchical adaptation in the multi-affix adaptor grammars p] From Ta 


ble 5.3 the fundamental word-forming rule for the multi-affix grammars is 


Word —>■ Pres Stem Sufs, 


where Pres and Sufs are not adapted (although Pre and Suf were adapted). We added 
adaptation to these categories, also introducing intermediate dummy rules to avoid 


ill-defined recursive adaptor grammar rules like Pres —> Pres Pre . The segmenta¬ 
tions induced by the resulting hierarchically adapted, concatenative grammar obtain 
an SBF of 22.2, marginally above the MConcat score of 19.6. The initial lack of 


hierarchical adaptation therefore affected segmentation quality adversely, but not 
sufficiently to fully explain the low segmentation quality on the Quran data setp] 


Sirts and Goldwater 


(2013) demonstrated that hierarchical adaptation can play a crucial role. 


lb The adaptation of affix non-terminals is nonetheless crucial. In preliminary experiments on Bw, 
segmentation quality dropped by about 50% when using unadapted non-terminal categories Pre and 
Suf. 
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Given the low segmentation quality on the Quran, it is not included in the subse¬ 
quent tasks. 


5.8.4 Task 2: Root identification 

The next evaluation task is that of obtaining the root morpheme, if any, given a sur¬ 
face word type as input. We describe two methods for root identification, followed 
by details of the baselines we compare to. 

5.8.4.1 Root extraction method 


Our root-templatic adaptor grammars are fundamentally unsupervised. The only 
built-in bias they have for capturing roots under the R non-terminals instead of the 
T non-terminals is that we only adapted the former, in particular R3 and R4, as set 
out in 


Table 5.3 Root extraction Method A thus hypothesises as root morpheme 


the characters governed by R3 or R4 in the parse tree, depending on whether the 
evaluation is on triliterals or quadriliterals. A shortcoming of this method is that it 
would miss cases where the root happens to be governed by a T non-terminal, which 
is also licensed. The more lenient extraction Method B thus considers both types 
of non-terminals, possibly hypothesising two root candidates for a word. As with 


the segmentation task, we use the MAP parse tree t(w) (cf. Equation 5.21) as input 
to these root extraction methods^ |Figure 53 provides an example of these root 
extraction methods, and illustrates the evaluation calculation we define below. 

Given an input word w, we denote the set of fc-literal roots hypothesised as 
hk(t(w )) G (0, {r^ }, { rr%}}. We focus on k G {3,4}. 

Similarly, we write the set of correct roots of the word w as (] k {w) = {r''}. Roots 
that contain weak radicals are filtered out, since our models deal purely with surface- 
realised phenomena and thus cannot express such roots. Owing to this filtering and 
the fact that some word forms legitimately do not derive from a root, gk(w) can be 
empty. Our evaluation skips such words. 


19 The posterior distribution over roots conditioned on a word form, as estimated from the collected 
samples, is sharply peaked and in most cases uni-modal, so that a MAP estimate is reasonable. 
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Let w =abcdef and the inferred parse tree t(w) = 


Stem abcdef 


R3 a • c•d T3 b • e • f 


Method A 

Method B 

hf(t(w)) = {acd} 

hf(t(w)) = {acd, be /} 

h£(t(w )) = 0 

hfitiw)) = 0 


Suppose g 3 (w) 

= {acd}. 

h 3 FI g 3 = {acd} 
exact_match 3 (tu) = 1 

h 3 n g 3 — {acd} 

exact_match 3 (w;) = 1 

Suppose g 3 (w) 

= {bef}. 

h 3 n g 3 = 0 

exact_match 3 (t(;) = 0 

~h 3 ng 3 = {bef} 

exact_match 3 (u;) = 1 


Figure 5.3: Illustration of the root extraction methods and scoring descibed in §5.8.4.1 For 
brevity, we present a single learning outcome for the word w (above table) and consider two 
alternative gold standard references g 3 (w ) (bottom two rows). 


The evaluation metric is exact match accuracy. A hypothesised root is counted 
as a match if it appears among the word’s gold roots. When multiple roots are 
hypothesised we check if any one matches a gold root. To be precise, define the 
exact match score between hypothesised and gold standard roots for a word w as 


exact_match fc (w) = 


/1 if h k (t(w )) fl g k (w) 7^ 0 


0 otherwise. 


(5.22) 


20 


The accuracy across the N' filtered test word types is ^\ =1 exact_match fc (iu t ) 
This metric is therefore recall-based. A precision-based evaluation would be 
overly harsh, since our root identification methods posit a root for most words. That 
is, the templatic adaptor grammars tend to rely heavily on the complex stem-forming 
rules, despite the presence of the simple rule, Stem —* Char + . 


5.8.4.2 Baselines 


We devised three straightforward randomised baseline methods to compare our model- 
based prediction accuracies. They all involve incremental deletion of characters 

20 These definitions are aimed at precise exposition, belying the fact that the ambiguity observed 
in our evaluation is very low; the vast majority of words have a single gold root, if any, and for 
hypothesised roots it is only Method B that occasionally posits more than one candidate. 
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from a word, chosen uniformly at random, until a string of k characters remain. The 
uniform guesser baseline applies this reduction method directly to the surface word 
form, while three further baselines operate on a constrained guessing space: 

Stem-constrained: The gold standard stem of a word is taken as input, thereby 
establishing the difficulty of identifying a root under a perfect segmentation 
model. 

Character-constrained: Starts with the surface word form, but strips out charac¬ 
ters never appearing as roots in the data set. This baseline thus has external 
knowledge not encoded in our models, but has no sense of word structure. 

Doubly-constrained: Uses both aforementioned constraints. This would often leave 
little ambiguity as to the identity of the root, and could be regarded as a bench¬ 
mark rather than baseline. 

Given the non-determinism of these baselines, we report the average of five repeti¬ 
tions of each. 


5.8.4.3 Results 


The results for root identification are given in Table 5.6 and Table 5.7 


The performance of the baseline methods confirms the intuition that root iden¬ 
tification is harder in the presence of vocalisation]^] It is consequently also for 
vocalised Arabic (Bw) that identification Method A performs poorly. The more 
liberal Method B succeeds in correctly identifying the triliteral root for up to one 


in five word types in Bw that meet the evaluation conditions specified in § 5 . 8 . 4.1 


This is still low accuracy, but exceeds the character-constrained baseline. Since 
that baseline gauges the extent to which vocalisation complicates root identification, 
we conclude from this outcome that our models go beyond merely overcoming the 
presence of vocalisation. The difference is explained by the model’s segmentational 


2 'Hebrew morphology and orthography are less complicated than Arabic (Daya et al. 2008), which 
is why the baseline accuracies for the vocalised Heb are generally higher than for Bw . 
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Bw' 

Bw 

Heb 

Random baselines 




Uniform guesser 

1.5 

8.6 

3.7 

Stem-constrained 

6.6 

43.6 

14.2 

Character-constrained 

10.3 

10.3 

30.3 

Doubly constrained 

41.7 

45.2 

79.9 

Method A: Extract R3 




Tpl 

0.8 

66.6 

72.9 

Tpl3Ch 

0.7 

64.5 

9.2 

Tpl+T4 

0.7 

67.1 

74.2 

TplR4 

0.9 

64.0 

69.5 

Method B: Extract R3_ and T3 



Tpl 

17.2 

66.9 

72.9 

Tpl3Ch 

0.8 

64.5 

9.2 

Tpl+T4 

19.8 

67.4 

74.7 

TplR4 

13.9 

64.1 

69.5 

Evaluated (TV 7 ) 

27783 

27494 

2738 


Table 5.6: Triliteral root identification accuracy. Exact matches of extracted and gold roots 
expressed as a percentage of the number of word types N' for which the true analysis con¬ 
tains a strong triliteral root. 



Bw' 

Bw 

Heb 

Random baselines 

Uniform guesser 

0.7 

8.4 

2.1 

Stem-constrained 

2.7 

40.7 

3.2 

Character-constrained 

10.5 

10.8 

15.8 

Doubly constrained 

39.4 

40.6 

66.3 

TplR4 

Method A: Extract R4 

0.1 

10.4 

0.0 

Method B: Extract R4 and T4 

14.1 

10.4 

0.0 

Evaluated (TV 7 ) 

2295 

2276 

19 


Table 5.7: Quadriliteral root identification accuracy. Exact matches of extracted and gold 
roots expressed as a percentage of the number of word types N' for which the true analysis 
contains a strong quadriliteral root. 


component. Nonetheless, the regularities leveraged in the process do not necessarily 
coincide with the linguistically correct roots. 

On Hebrew and unvocalised Arabic, there is no meaningful difference between 
Methods A and B. 
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For triliteral root identification, the templatic grammar Tpl+T4 that includes 
four-character templates performs best, with accuracies of 74% on Hebrew and 67% 
on unvocalised Arabic. In the latter data set there are about 9000 words for which 
the stem and root are equivalent; excluding those words from the evaluation de¬ 
creases the accuracy from 67% to 57% accuracy. This is still substantially above the 
competitive Doubly constrained baseline. The fact that Tpl+T4 outperforms Tpl3Ch 
shows that, for these two data sets, it is more important to capture templatic struc¬ 
tures of the form VC VC VC V than to handle forms like CVVCVC. On Hebrew and 
unvocalised Arabic, this adaptor grammar jointly solves the task of segmentation 
and root identification to a fairly high standard. 

Our model capable of quadrilteral root identification, TplR4, obtained accura¬ 
cies in the region of the Character-constrained baseline for the two Arabic data sets. 
Closer analysis shows that this low accuracy is not due to worse segmentation per¬ 
formance on those words containing quadriliteral roots. Instead, we find that even 
if the model segments a word correctly, it often accounts for the stem sufficiently 
using the R3 or T3 categories, so that our method extracts no quadriliteral root. For 
cases where a quadriliteral root is hypothesised by our extraction methods, we find 
that they often match the gold root partially. In this sense, the exact match metric is 
relatively harsher than for evaluating triliteral root extraction. 

In summary, these results show that a given instantiation of our root-templatic 
adaptor grammars can correctly identify triliteral and quadriliteral roots of a small 
portion of words, but that more refined grammar engineering would be needed to 


increase the accuracy. Example analyses are provided in Figure 5.4 on p. 122 


5.8.5 Task 3: Morpheme lexicon induction 

The previous two tasks considered the challenging setting of predicting a single cor¬ 
rect segmentation and root for a given word type. In this section, we turn to another 
dimension of morphology learning by evaluating the morpheme lexica inferred by 
the adaptor grammars. By morpheme lexica we mean the sets of strings identified by 
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Concat parse 

TplR4 parse 


Pre(l i) ... Stem(danotill)... Suf(A) 


R4 d • n • t•1 

T4 o • a • i • 1 

Word 

R3d n t R1 

T3 o • a • i T1 

Pre Stem 

A 

A 


R2 d n R1 I 

T2 o • a T1 I 

1 i d a n o t i 11 A 

/\ A 

/\ A 


R1 R1 t 

T1 T1 i 


A A 

A A 


d n 

o a 

(b) Vocalised Arabic input word “lidanotillA”. Reference analysis: li^danotill A 


Figure 5.4: Parse tree examples for a word from the Arabic data sets Bw and Bw where the 
concatenative adaptor grammar segmented incorrectly (left). 

The templatic grammars (right) correctly identified the triliteral and quadriliteral roots, and 
fixed the segmentation of (a). In (b), the templatic grammar improved over the baseline by 
finding the correct prefix but falsely posited a suffix. 

(Unimportant subtrees arc elided for space, while the yields of discontiguous constituents 
are indicated next to their symbols, with dots marking gaps. Crossing branches are not 
drawn but should be inferrable. Root characters arc bold-faced in the reference analyses. 
The non-terminal X2 in (a) is paid of a number of implementation-specific helper rules that 
ensure the appropriate handling of partly contiguous roots.) 
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a model as being members of a given morphological category]^ The ability to de¬ 
rive such lexica from unannotated data can be useful for the construction of linguistic 
resources in low-resource or undocumented languages, and is relevant in computa¬ 
tional studies of language acquisition ( |de Marckenj|1996[|Goldwate4|2007| ). 

Purely concatenative models of morphology can only infer lexica of prefixes, 
suffixes and stems. Our approach to modelling explicitly the transfixal formation 
of Semitic stems thus have two potential benefits: it can improve the quality of 
the aforementioned lexica, but it also enables the acquisition of the lexicon of root 
morphemes. The aim here is to test the extent to which these two potential benefits 
are borne out by the SRCG-based adaptor grammars on our Hebrew and Arabic data 
sets. 

We evaluate two methods of obtaining morpheme lexica from the adaptor gram¬ 
mar posterior samples: 


1. use the MAP parse of each word type as before ( ^5.8. 3.1 [ ). but aggregate mor¬ 
phemes across word types; 


2 . marginalise directly over both word types and posterior samples to induce 
probabilistic lexica. 


5.8.5.1 MAP-based morpheme lexica 

We take the MAP parse trees under a given adaptor grammar as starting point, and 
predict the lexicon of a particular morphological category (stem, affix, root, etc.) 
as the unique set of strings appearing under that category across the word types in 
the data setj^j Prefixes, stems and suffixes are identified using the same method of 
traversing the parse trees as in the segmentation task, and we discuss the root lexicon 
shortly. 

The evaluation metric is the F-score when comparing a hypothesised lexicon 
against the corresponding gold lexicon. 

“For simplicity, we use the term “string” in this section to mean both contiguous character se¬ 
quences as well as non-contiguous tuples of characters. 

2 i A given string can therefore appear in multiple lexica, which fits the linguistic reality. 
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The primary result is that, consistent with our expectations, the models capa¬ 
ble of forming complex stems obtain a marked improvement in prefix, stem and 


suffix lexicon F-score compared to the baseline concatenative adaptor grammar (Ta¬ 


ble 5.8). The amount of improvement grows with the expressivity of the templatic 


grammar, most pronouncedly for the vocalised Arabic data set. 

For unvocalised Arabic, the templatic grammars improve over the baseline by re¬ 
covering more of the correct stems—recall goes from 32% under Concat to 47% un¬ 
der TplR4 while precision remains at 74%. This equates to an additional 3673 cor¬ 
rectly inferred stems. On vocalised Arabic and Hebrew, the stem F-score improve¬ 
ment is much more substantial (roughly 20% to 60%) and arises from increases in 
both precision and recall; precision reaches about 70% and recall 54% for these two 
vocalised data sets when using TplR4. This grammar thus induces stem lexica at 
quite high precision for three different data sets. 

In contrast to stems, the affix lexica are very noisy. The concatenative models 
tend to over-generate to obtain very high recall but low precision. The templatic 
grammars balance out that trade-off to some extent. 

To obtain a root lexicon based on the MAP derivations for each data set we 
leveraged the outcome of the earlier root identification experiment (§ |5.8.4.3 ). For 
Bw and Heb, we take the R3 yields, while for Bw’ we take the T3 yields. Variation 
among the different templatic grammars was negligible, so we report the results only 


for Tpl in Table 5.9 The lexica of roots identified for the two Arabic data sets in 
this fashion contained just over 50% legitimate roots. The high recall of 80% on the 
unvocalised version equates to recovering 2602 of the 3254 surface-realised triliteral 
roots present in that data set. 


5.8.5.2 Probabilistic morpheme lexica 

The preceding set-based evaluation of the morpheme lexica imposes hard decisions 
about category membership and does not make use of the fact that the adaptor gram¬ 
mars induce a probability distribution over strings. In this section we measure mor- 
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Bw' 



Bw 



Heb 



Pre 

Stem 

Suf 

Pre 

Stem 

Suf 

Pre 

Stem 

Suf 

Concat 

15.0 

20.2 

25.4 

32.8 

44.1 

40.3 

18.7 

20.9 

29.2 

Tpl 

24.7 

39.4 

35.2 

45.9 

54.7 

47.9 

35.1 

59.6 

52.9 

TplR4 

37.8 

60.3 

47.0 

53.0 

57.7 

51.9 

38.0 

62.4 

55.2 


Table 5.8: Morpheme lexicon induction quality. FI-scores for lexicons induced from the 
maximum a posteriori parse of each different dataset. 


P R F 

Bw' (T3) 54.9 47.9 51.2 

Bw(R3) 51.6 80.0 62.7 
Heb (R3) 25.3 55.3 34.8 


Table 5.9: Precision, recall and F-score for root lexicon induction using the grammar Tpl. 


pheme lexicon induction performance by considering the probability with which a 
string falls under a particular non-terminal category in an adaptor grammar. 

From our posterior samples we can estimate the joint probability of a string s 
appearing under grammar category C by summing over all word types and posterior 
samples: 


J N 


j =1 i =1 


1 if non-terminal C dominates s in analysis t 
0 otherwise. 


(i) 


We can thus produce a list of roots ranked by their joint probability values under a 


given adaptor grammar. As an example, we present in Table 5.10 the top five trilit¬ 
eral roots obtained from the Hebrew data set in this fashion. The first four of these 
correspond to correct Hebrew roots, demonstrating that the SRCG-based adaptor 
grammars can learn linguistically relevant, discontiguous entities from unannotated 
data. In the same table, we present some example usages of the roots. As a concrete 
example of how the model successfully encodes the original intuition that discon¬ 
tiguous morphemes can be shared across word forms, consider that it correctly as¬ 
sociates the root IbS with such orthograpically distinct word forms as labaSt, tilbeSi 
and lehalbiS, which are not linked by a shared stem. 
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ROC analysis A snapshot of the top-ranked items does not quantify the quality 
of a morpheme lexicon very broadly. But it illustrates a simple sense in which we 
can combine a decision rule with an adaptor grammar to obtain a discrete classifier 
of strings. Instead of the decision rule “top n items from ranked list”, we can also 
base the binary decision of whether a particular string s is part of a morphological 
category C directly on the probability P{s\C) using a threshold r between 0 and 1. 
Let a predicted lexicon L f be the set of strings for which P(s\C) > r. 


This allows using receiver operating characteristics (ROC) analysis (Egan 1975 


Fawcett| [2006) to characterise the trade-off between true positive and false positive 


predictions across different decision thresholds r. A ROC plot is obtained by plot¬ 
ting the number of true positives (e.g. true stems) versus the number of false positives 
(e.g. false stems) for L^ as r is varied from 0 to 1. A coin-tossing random baseline 
classifier traces a diagonal line (y = x) through the origin. Better classifiers give 
rise to points in ROC space toward the upper left of the plot. Finally, the area under 
a curve (AUC) on the ROC plot gives a summary of the general predictive perfor¬ 
mance of a model. Because the ROC space is the unit square, AUC is a number 
between 0 and 1. AUC is equivalent to the probability that a model would rank a 
randomly chosen positive instance higher than a randomly chosen negative instance 
(Fawcett| |2006j ). ROC plots and AUC are non-parametric methods, and therefore 
give robust results insensitive to internal differences between models or skews in the 
number of true versus negative instances. 


Figure 5.5 shows the ROC plots and AUC values for inducing stem and root 


lexica from our data sets using different adaptor grammars. The first main outcome 
is that the prediction of stems benefits from the use of the root-templatic SRCG rules, 
consistent with the results in the MAP-based evaluation. This effect is again most 
pronounced for the vocalised data sets Bw and Heb. The second main result is that 
the prediction of root lexica for Heb and B w is substantially better than the random 
baseline: the SRCG-based models would rank a true root above a non-root string 
with probability 0.77 and 0.82 on these respective data sets. For the vocalised B w , 
the corresponding value is 0.68. As a means for predicting whether or not particular 
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strings are root morphemes of the language, our method thus performs better than 
might be expected from the low root-identification results in the per-word analysis 
in j j5.8.4.31 


P(s, R3) 

Root s 

Correct Examples 

Mistaken Examples 

Bad Segmentation Bad Root 

0.00649 

spr / 

sapar^ti 

ye^sapr^u 

megisapr^im 

hi®stap®a r t t 

0.00527 

lbs / 

labaS^t 

tiJbeSJ 

le-i-ha^lbiS 


0.00524 

ptx / 

patax^ti 

t^ptexj 

n^ptax^at 

li^p t oax 

0.00521 

rxc / 

yi^rxac 

roxec^et 

lc|hit?,raxcc 


0.00510 

!al x 



le-|-ha 0 !al®ot 

! c 1 .an-it 


Table 5.10: Top five Hebrew triliteral roots hypothesised by the grammar Tpl+T4 as ranked 
by model probability, along with examples of their occurrence in the analysed words. The 
first four are legitimate roots, while the fifth is not a valid root. 

The middle two columns give examples of words where the model found the correct seg¬ 
mentation points and the predicted root characters coincide with the true root charac¬ 
ters (bold). The penultimate column shows examples where the root is found despite seg¬ 
mentation mistakes, marking false positive (®) and false negative (j) segmentation points. 
Examples in the final column feature segmentation and root mistakes—false positive root 
characters appear in grey, while false negatives are boxed . 
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Hebrew - Stems 



False Positive Rate 


Hebrew - Roots 



Arabic - Stems 



False Positive Rate 


Arabic - Roots 



vocalised Arabic - Stems 



False Positive Rate 


vocalised Arabic - Roots 



Figure 5.5: ROC curves for predicting the stem lexicon (left) and root lexicon (right) for the 
Hebrew and Arabic data sets, as described in § 5.8.5.2| The area under each curve (AUC) is 
given in parentheses, as computed with the trapezium rule. 
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5.9 Summary 


This chapter introduced a new approach to the unsupervised learning of non-con- 
catenative morphology. We showed how the simple Range Concatenating Grammar 
formalism can be used to encode non-concatenative morphological processes which 
do not involve character deletion or insertion when composing morphemes into a 
surface word form. This approach can account for infixation and circumfixation, but 
our focus was to instantiate and evaluate the general idea on the root-template-based 
morphology employed in Semitic languages. For a learning framework, we made 
the natural extension of non-parametric Bayesian adaptor grammars to the SRCG 
formalism, which makes it possible to model explicitly the tendency of morphemes 
(contiguous and discontiguous ones) to be reused in different words. Aside from 
our application to morphology, this learning approach could also be useful in other 
applications of SRCGs. 

Our experimentation on Hebrew, Quranic Arabic and two versions of a synthetic 
data set containing morphotactically valid standard Arabic words shows that mod¬ 
elling discontiguous sub-structures of words improves the segmentation of words 
into morphemes. A simple root identification method correctly identified triliteral 
roots for about two-thirds of the words where they are expected, for two of the 
aforementioned data sets. On vocalised Arabic, absolute performance across the 
different tasks was generally lower, and very low for the Quran. Finally, we found 
that our SRCG-based models could be used to infer a lexicon of triliteral root mor¬ 
phemes. This mixed outcome flows from the fact that we deliberately sought to 
limit the amount of language-specific knowledge built into the grammars in order 
to test to what extent a high-level notion of intercalated morphology is beneficial in 
these ki nd of unsupervised models. For the Hebrew and synthetic Arabic data sets, 
this was balanced against exercising more control over the word shapes we included 
in the evaluation, whereas for the Quran we applied no informed preprocessing or 
standardisation whatsoever. 
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Chapter 6 
Conclusion 


This dissertation presented original work on the problem of capturing sub-word 
structure in probabilistic models of language. The central approach was to use the 
essence of certain morphological processes as a basis for extending three different 
types of models in order to make them better suited to morphologically rich lan¬ 
guages. We applied this approach both to language models (LMs) which define 
probabilities of word sequences, and to unsupervised models that seek to induce the 
morphemes of individual words from unannotated text. 

In jchapter 3 we presented a hierarchical Bayesian model of word sequences 
which accounts for productive compound formation. We exploited the right-head- 
edness of German compounds to structure the statistical dependencies between a 
compound word and its immediately preceding context, while using an embedded 
language model to generate the modifiers within each compound independently of 
the word’s context. The main findings from our evaluation were as follows: 


1. The compounding language model succeeded in reducing the data sparsity 
due to compounding by expressing compounds in terms of their parts—in our 
data set, the number of distinct elements in the base distribution at the root of 
the hierarchical model is reduced to less than half the number of unsegmented 
word types. The model obtained small improvements in test set perplexity 
compared to a baseline language model that views compounds as atomic ele¬ 
ments. 


130 





2. The use of the compounding language model as a feature in a machine trans¬ 
lation system led to translations that were judged to be equivalent in overall 
quality with that from a baseline language model, as measured with Bleu. 
Finer-grained measurement showed that the compounding model did, how¬ 
ever, improve the translation system’s precision in outputting compound words, 


3. Structuring the statistical dependencies according to the right-headedness of 
German compounds played a key role in the aforementioned outcomes. This 
was established by observing a degradation in perplexity and Bleu when us¬ 
ing a model variant that posits left-headedness. 


In chapter 4 we shifted the attention to another type of language model to con¬ 
sider how sub-word structure could be leveraged in another setting. Based on auto¬ 
matically induced distributed feature representations of words and smooth scoring 
functions, distributed language models (DLMs) circumvent the need for smoothing 
by means of back-off as employed in traditional n-gram LMs, and as we refined 
further in the compounding model. Our contribution in this regard is two-fold: We 
show that DLMs can benefit substantially from explicitly modelling sub-word struc¬ 
ture, and we determine that it is possible and beneficial to use a normalised DLM as 
part of a machine translation decoder. 


We formulated a variant of the log bilinear (LBL) language model (Mnih and 


Teh, 2012) that incorporates a simple method of composing word vectors from mor¬ 


pheme vectors. Morpheme vectors are leamt as part of the model, leveraging a 
separate unsupervised model to obtain the morphological segmentation itself. The 
method is inspired directly by recent positive results in modelling compositional 
morphology ( |Luong et al.[ |2013| |Lazaridou et al.[ |2013| ) and the observation that 
word-based DLMs already capture non-trivial, morphologically relevant regularities 
(Mikolov et al.[ 2013b) . The main findings from our monolingual evaluation are as 
follows: 


1. The morpheme-based LBL achieved substantial reductions in test set perplex¬ 
ity for six languages of varying morphological complexity. The model af- 
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fects a soft tying of parameters across words that share morphemes, which 
contributed to the largest perplexity improvements being generally on rarer 
words. This is an important outcome for modelling morphologically richer 
languages, where rare words tend to form a larger portion of the vocabulary. 

2. The availability of morpheme vectors allowed for the composition of high- 
quality word vectors for out-of-vocabulary words, as measured in a word sim¬ 
ilarity rating task on multiple individual languages. 

The integration of correctly normalised LBLs into a machine translation decoder 
was made tractable by partitioning the vocabulary into word classes and decompos¬ 
ing the probability model accordingly. This is a known method of decreasing the 
computational cost involved in computing probabilities with DLMs. To our knowl¬ 
edge, this work is however the first to deploy it successfully in translation—previous 
uses of DLMs in translation have been limited to rescoring n-best lists or lattices, or 
have used unnormalised DLMs. Our evaluation findings were as follows: 

1. Use of a LBL LM feature in the log-linear translation model (in addition to a 
modified Kneser-Ney (MKN) LM feature) provided consistent improvements 
in Bleu when translating into six languages of varying morphological com¬ 
plexity, compared to a baseline system using only the MKN LM. 

2. The morpheme-based LBL provided further small increases in Bleu, though 
the improvements cannot be separated entirely from the instability of the op- 
timiser used to tune the translation model feature weights. 

As a third and final instance of capturing sub-word structure in probabilistic 
models of language, we switched perspective to the question of acquiring such sub¬ 
word structure in the first place, rather than relying on an external source as we 
did in the sequence-based models. Our contribution in this regard is to suggest that 
simple Range Concatenating Grammars (SRCGs) are useful for encoding both con- 
catenative and non-concatenative morphological processes in a single formalism. 
The grammar-based approach allows for unsupervised learning to be done using 
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the non-parametric Bayesian framework of adaptor grammars, which we instantiate 
for the SRCG formalism. We applied this approach to Semitic morphology, where 
words are often formed by first interspersing the characters of root morphemes into 
templatic structures and then attaching further affixes in a concatenative fashion. As 
such, it offers a useful test-bed for the notion that joint modelling of concatenative 
and non-concatenative morphological processes could have a mutually disambiguat¬ 
ing effect in morphological analysis. Our evaluation gave rise to the following main 
findings: 

1. Modelling discontiguous sub-word structure improved the segmentation of 
Arabic and Hebrew words into linear morphemes, as measured by the F-score 
in comparison to baseline grammars that ignore discontiguous sub-structure. 
This holds for both orthographic variants of our Arabic data set (with and 
without vocalisation). 

2. A simple method for extracting root morphemes from the inferred parse trees 
of words produced mixed results. On two of the data sets it correctly identified 
about two-thirds of triliteral roots for the words where such roots are expected, 
but for the other data sets and for the identification of quadriliteral roots in all 
our data sets, the accuracy was low. 

3. The intercalating adaptor grammars also produced drastic F-score increases 
for the lexica of stems induced from data, relative to the purely concatenating 
baseline grammars. This is related to the improved segmentation performance. 

4. Considered as binary classifiers of strings, the intercalating adaptor grammars 
obtained moderately high performance in discriminating between string tuples 
that are triliteral roots and those that are not. 

A characteristic of our unsupervised morphology learning work was that we erred 
on the side of underspecification when it came to engineering rules to fit the data 
sets and languages used, and this is reflected in the mixed results. Our primary 
intention was to test the extent to which a high-level notion of non-concatenative 
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morphology could reasonably be expressed in adaptor grammars. This paves the 
way for interesting future applications, and creates a framework within which it is 
straightforward to encode liner-grained knowledge of particular non-concatenative 
processes. 

In conclusion, this dissertation showed that the explicit inclusion of sub-word 
structure in probabilistic language models is beneficial for morphologically com¬ 
plex languages in settings ranging from unsupervised morphology learning and word 
similarity rating to the estimation of word-sequence probabilities and their applica¬ 
tion in machine translation. 


6.1 Extensions and further research 


Possible directions for further research emanating from the work presented in this 
dissertation include the following: 

1. The LMs developed in Chapters [3] and [4] are in principle capable of provid¬ 
ing probabilities for OOV words by leveraging their sub-word representations 
encoded in the models. For a machine translation system to benefit maxi¬ 
mally from that ability, the system itself also requires the ability to hypoth¬ 
esise novel word forms in its translation output. An interesting direction for 
future research is to combine our LMs with the recently proposed method¬ 


ology of Chahuneau et al. (2013a) where novel inflections of words can be 
hypothesised by the translation system. 

2. Both LMs made use of deterministic segmentations of words into sub-parts. 
An alternative would be to leam the segmentation jointly with the rest of the 


model, e.g. as done by ( Mochihashi et alfi 2009) in the context of Chinese 
word segmentation. 

3. The method from Chapter |4]for composing word vectors from morpheme vec¬ 
tors through addition is readily implementable in bilingual or multilingual dis¬ 
tributed models (e.g. Zou et al., |2013j ). Aside from the parameter tying effect 
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that should also be useful for rare words in that setting, an interesting question 
is whether a morpheme-based bilingual model can capture the tendency of a 
given language to encode certain grammatical properties with distinct words 
while another language uses bound morphemes for the same. 


4. Our approach to modelling discontiguous sub-word structure in Chapter [5] 
presents interesting opportunities for extending the line of work that approaches 
unsupervised morphology learning from the perspective of language acquisi¬ 
tion. Frank et ak] ( j2013[ ) recently considered the mutually disambiguating roles 
of morphology and syntax in this context, and a similar study on languages 
employing non-concatenative morphology could benefit from our work. 


5. The use of SRCGs in Chapter [5] affords modelling convenience but comes 
in exchange for high polynomial parsing time complexity. As with adaptor 
grammars generally, the emphasis is on being able to explore different model 
assumptions easily. Yet full recursive power is not necessary for many mor¬ 
phological processes, so that a relevant question is whether finite-state trans¬ 
formations or approximations could be used to cast a given adaptor grammar 
into a more efficient computational framework. There is a rich literature on 


this for context-free grammars (e.g. Nederhof 2000). 


6 . The burden of defining the appropriate set of rewrite rules for the base distri¬ 
bution grammar could be overcome by using a semi-supervised approach that 
searches for optimality over different instantiations of so-called metagram¬ 


mars (Sirts and Goldwater 20131. 
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